Forum for discussion of mid, final term and final project

Number of bits of the identifiers in input

Number of bits of the identifiers in input

by ALESSANDRO BOCCI -
Number of replies: 14

Hello,

I have a couple of questions reguarding the number of bits of the identifiers to give in input to the simulation: if we must use SHA1 for node identifiers and keys, considering that the number of bits output of SHA1 is 160, how can we use the number in input if it's different from 160?

Furthermore, if we use 160 bits for the identifiers, the coordinator could have some troubles to work: with a low number of nodes the finger tables will have few different entries and the routing will be similar to navigate a linked list; with a huge number of nodes the coordinator will have a big burden generating and writing on file the finger tables, and the simulation take a lot of time. How can we deal with that?

In reply to ALESSANDRO BOCCI

Re: Number of bits of the identifiers in input

by ALESSANDRO PAGIARO -

I think that we can apply the % (modulo) operator to the 160-bit address to resize it. 


In reply to ALESSANDRO PAGIARO

Re: Number of bits of the identifiers in input

by ALESSANDRO BOCCI -
Yes, but with modulo there are collisions and we have not a politic to manage them. Plus we can use modulo without SHA1 and have the same behaviour, and I understood that it's not what they want us to do. Maybe I misunderstood something...
Edit: I answerd to ALESSANDRO PAGIARO :)
In reply to ALESSANDRO BOCCI

Re: Number of bits of the identifiers in input

by MARCO SPINOSA -

No, don’t do the modulo. It doesn’t preserve the uniform distribution of the nodes and it may lead to wrong analysis.

According to https://crypto.stackexchange.com/questions/9435/is-truncating-a-sha512-hash-to-the-first-160-bits-as-secure-as-using-sha1 it is safe to truncate the SHA (I did some tests on it and it works).

I’m currently using the SHA512, truncating it to 1, 2… characters to generate rings of 16, 64…identifiers. The results of the analysis are the same of the tests in the paper.

I wrote an email to the teacher asking to discuss this solution to be sure that it is correct, but no answer so far.


In reply to MARCO SPINOSA

Re: Number of bits of the identifiers in input

by DAVIDE RUCCI -

Maybe I am missing something, but on the slides about Chord it is written clearly that all the arithmetic operations are done modulo 2^m so I personally don’t see the point in not using the modulo... (how to do the modulo with 160 bits is another issue)

In reply to DAVIDE RUCCI

Re: Number of bits of the identifiers in input

by MARCO SPINOSA -

I mean using the modulo to shrink the identifiers’ space. All the operations are modulo 2160 in Chord, but if you want smaller identifiers it is better to truncate the SHA rather than do the modulo of the integer representing the SHA.


e.g. Let’s suppose you want a ring with 8bit identifiers (at most 256 nodes). 

Using the modulo, you can do Int(SHA)%28. By doing this you can generate (for instance) 20 nodes’ identifiers, but it is not guaranteed that those identifiers are uniformly distributed in [0, 256].

If you truncate the SHA first, and then converts it into Int (i.e. Int(SHA.firstTwoChars()))*, the generated identifiers are uniformly distributed in [0, 256] (at least more uniformly distributed than using the modulo as shown in the picture attached).


*I’m assuming you’re using a hex representation, so each digit is in [0,16] = 4bit


Attachment Int(sha)% vs Int(trunc(sha)).png
In reply to MARCO SPINOSA

Re: Number of bits of the identifiers in input

by DAVIDE RUCCI -
But I guess this is because of the truncation to 32/64 bits, isn’t it? Put in another way, if we were able to manage all the 160 bits and do the modulo operation over them this would not cause the skew in the distribution, is this correct?
In reply to DAVIDE RUCCI

Re: Number of bits of the identifiers in input

by MARCO SPINOSA -

Yes, it is due to the reduction done by the modulo. If you use all 160 bit there is no skew in the distribution because all the numbers are generated between 0 and 2^160 and doing the module (%2^160) has no effect.

https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator here there is a simple example of what happens if you use the modulo to shrink an uniform distribution into a smaller range



In reply to MARCO SPINOSA

Re: Number of bits of the identifiers in input

by MICHELE MAZZONI -

Note that truncating the hash to the first m bits is the same as computing the hash modulo 2^m, so if you have identifiers of 8 bits computing SHA mod 256 is equivalent to taking only its first 8 bits, i.e.:

da39a3ee5e6b4b0d3255bfef95601890afd80709 mod 100 = 09

In reply to MICHELE MAZZONI

Re: Number of bits of the identifiers in input

by ORLANDO LEOMBRUNI -

Isn't truncating the hash to the first m bits actually the same as dividing by 2^m?

10111 mod 100 = 11
10111 div 100 = 101

So you're taking the last (i.e. least significative) m bits, not the first.

In reply to MICHELE MAZZONI

Re: Number of bits of the identifiers in input

by MARCO SPINOSA -

/* I'm sorry, I didn't refresh the page, so I didn't see Orlando's answer */


You’re right, but it’s not what I wrote.


Let’s consider this string: 192.168.1.198:64701

Its SHA512 is dbe64da677c59907c8bd4bc47f351b0684bb6c784da8b0084dc783fec296589275be60b1d5a9e2d51ee13ca4986850dd19540cdf359a0dc441fee9c91b04ab25

that corresponds to  11517077735634809028583328739248267125354759364135388199387752889595252297855399413735965976687147922166908471966158523467663989006762635142132461247703845


let’s suppose you want a ring of 16 identifiers (4 bit)

To represent 16 identifiers in hex you need a single digit, so truncate the SHA at the first digit

SHA:d -> Int:13


If you do the modulo you get 5 (as you pointed out).


Now, I don’t think it is matter of taking the most or the least significative bits but, from the test I did, truncating the SHA results in a better distribution. I'm surprised too that there is a difference (I’m just trusting stack overflow), and this is the reason why I wrote an email to the teacher.


In reply to MARCO SPINOSA

Re: Number of bits of the identifiers in input

by MICHELE MAZZONI -
Yes sorry, my fault as I didn't check the first SF link and I thought that "first" was referring to the least significant bits.

Anyway, if you are referring to the first answer to this question https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator, it can't be the reason for skewness in the distribution because we are in the case in which every equivalence class modulo 2^M has the same cardinality, that is applying the modulo every number has the same probability of being generated, this is also stated in the SF answer: "So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1", in our setting the condition is (2^160 - 1) mod 2^m = 2^m - 1, and it is true, i.e. 2^160 - 1 mod 2^8 = 255 = 2^8 - 1.

What do you mean by "I don’t think it is matter of taking the most or the least significative bits"? That is the point as you are taking the 'm' most significant bits whereas the modulo takes the 'm' least significant bits.
I tried a few tests and the and I don't see any sensible difference between the two distributions , also increasing the number of tries (N) flattens the distribution as one would expect for a uniform distribution, you can check the results and the code here: http://nbviewer.jupyter.org/gist/micmn/ad18e3dd2e183a5afeabf2b249b0a5ab
In reply to MICHELE MAZZONI

Re: Number of bits of the identifiers in input

by MARCO SPINOSA -

Yes, you’re right, I’ve checked that statement after the answer. I think it is the reason why truncate the SHA works: “it enforces you to do the modulo using a power of two”.

Anyway, the question was if doing the modulo is a right way of shrinking the interval, and in general it is not. If you chose a “wrong” number (even a prime as usual) you introduce a skew in the distribution.


With "I don’t think it is matter of taking the most or the least significative bits" I meant that by how the SHA is built, in my opinion I can take a group of bits in any position with no difference.


I checked your code and it seems correct. After the mathematic demonstration, sincerely, I don’t know why seems that a difference exists. Maybe there is a difference in the “speed” of generating the numbers in the interval (I tried your code with N < 2m, since you can’t generate more nodes than 2m.) Did you performed other tests?


In reply to MARCO SPINOSA

Re: Number of bits of the identifiers in input

by Laura Ricci -

First of all, congrats, guys, for your efforts in finding a solution, it's the first time I see a course forum really working. Doing the modulo is a  not the right way of shrinking the interval because you can lose the uniformity of the distribution. In any case, doing the modulo corresponds to considering the least significative bits. On the other way, I understand you have several difficulties with working with very large numbers, doing operations with BigInt and so on, may bring to very long computations, so computing modulo is mandatory. I have checked the statement in https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator, and it really seems to me the solution to our problem. Anyway, we can speak about it today afternoon.