I have encountered an interesting random sampling problem:
Generate n integers that sum up to a fixed number. Ideally, the n numbers should be close to the mean of these numbers within a ~25% deviation.
My first thought was sampling a Gaussian distribution. The problem was that mean and deviation won't always be integers, thus I'll need to trunk the fractions and later looping all the generated numbers to fix them to the sum.
It seemed an easy problem at the beginning, but the tricky thing here is the integers. I browsed around on Dirichlet distribution (sum to one but fraction numbers), integer programming (while the feasibility is quite low) and landed on Multinomial distribution.
Method One: Multinomial Distribution
After a bit of research, I found a very handy sampling method using multinomial distribution in numpy. It ensures the sum of the numbers, and generates integers. The python implementation is as follows:
>>> import numpy as np
>>> _sum = 800
>>> n = 20
>>> rnd_array = np.random.multinomial(_sum, np.ones(n)/n, size=1)[0]
>>> print rnd_array
[28 38 50 32 40 43 39 33 40 28 46 41 55 44 36 28 45 35 49 50]
The multinomial
random generator in numpy is taking 3 parameters: 1) number of experiments (as throwing a dice), which would be the sum of numbers here; 2) array of n probabilities for each of the i-th position, and I assigned equal probabilities here since I want all the numbers to be closer to the mean; 3) shape of vector.
This method is almost working except for that the deviation cannot be controlled strictly within the desired range. In my case, it satisfied what I want mostly, if not, I'll give it a few more trial runs.
Method Two: Random Integer Generator Within Lower and Upper Bounds
Inspired by the multinomial distribution, I experimented with another method for generating random integers within lower and upper bounds. It starts from an array of lower bounds and adding ones to random positions of the array until the sum requirement is met. The implementation is as followers:
def generate_random_integers(_sum, n):
mean = _sum / n
variance = int(0.25 * mean)
min_v = mean - variance
max_v = mean + variance
array = [min_v] * n
diff = _sum - min_v * n
while diff > 0:
a = random.randint(0, n - 1)
if array[a] >= max_v:
continue
array[a] += 1
diff -= 1
print array
Given the same _sum = 800 and n = 20, the example result is as shown:
[41, 40, 47, 34, 46, 41, 36, 39, 38, 43, 42, 43, 41, 38, 40, 34, 38, 43, 37, 39]