Distribution simulation

by RS admin@robinsnyder.com : 1024 x 640

Simulating a known distribution can be important.

This page looks at how one can simulate a distribution.

For simplicity, let us use the following distribution. At one time, the typical distribution of a plain package of M&M's was as follows.

Brown - 30%
Yellow - 20%
Red - 20%
Orange - 10%
Green - 10%
Blue - 10%

Try performing an Internet search to see how the distributions have changed over time.

See, for example, https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics/ (as of 2020-04-17).

The process of binning provides a set of values and a count of those values in each bin. Let us take the above distribution percents and operationalize it with the following binned list.

To see how to take raw data and summarize the data to binned form see Summarizing data : The M&M Problem .

Here is the example binned data.

colorDict1 = {
	"brown" : 6,
	"yellow" : 4,
	"red" : 4,
	"orange" : 2,
	"green" : 2,
	"blue" : 2,
	}
colorLen1 = len(colorDict1)

The total of the counts is needed.

One way to get the total counts of the items in the (binned) dictionary is to use the reduce function (from package functools).

total1 = reduce(lambda x,key1: x + colorDict1[key1],colorDict1,0)

The first argument to reduce is the lambda function that takes the cumulative value x and the key key1 - since the second argument to reduce is a dictionary.

The third argument to reduce is the starting value 0 for the binary reduction operations - since 0 is the identity element for binary addition.

For each random selection, the count of total times, 20 (above), and of individual items needs to be known. A random number from 0.0 to 1.0 is drawn.

Then one goes through the list until past the cumulative percentage is greater than (or equal to) the random number.

Here is a table of individual and cumulative contributions.

6=0.300 6 0.300 4=0.200 10 0.500 4=0.200 14 0.700 2=0.100 16 0.800 2=0.100 18 0.900 2=0.100 20 1.000

Here is the code to go through the dictionary of counts.

r1 = rand1.random()
sum1 = 0
for color1 in colorDict1:
	count1 = colorDict1[color1]
	sum1 += count1
	pct1 = float(sum1) / float(total1)
	if pct1 >= r1:
		break

The color color1 at the exit of the for loop is the color selected.

A technical point: The random number generated using rand1.random() is greater than or equal to 0.0 but less than 1.0. Thus, the for or while loop used will always exit with a break and never complete.

This is one place where, otherwise, one might use the else part of a for or while loop in case there was no break that case could be handled. Here is the Python code [#4]

import random
from functools import reduce
colorDict1 = {
	"brown" : 6,
	"yellow" : 4,
	"red" : 4,
	"orange" : 2,
	"green" : 2,
	"blue" : 2,
	}
colorLen1 = len(colorDict1)
total1 = reduce(lambda x,key1: x + colorDict1[key1],colorDict1,0)
print("Total items: {0:d}".format(total1))
mlen1 = 8
slen1 = 6
print("")
print("Simulate {0:d} items , {1:d} times".format(mlen1,slen1))
for spos1 in range(1,slen1+1):
	seed1 = spos1
	rand1 = random.Random(seed1)
	mList1 = []
	for mpos1 in range(1,mlen1+1):
		# get random number from 0.0 to 1.0
		r1 = rand1.random()
		sum1 = 0
		for color1 in colorDict1:
			count1 = colorDict1[color1]
			sum1 += count1
			pct1 = float(sum1) / float(total1)
			if pct1 >= r1:
				break
		mList1.append(color1)
	mText1 = " ".join(mList1)
	print("")
	print("{0:d}. {1:s}".format(spos1,mText1))

Here is the output of the Python code.

Total items: 20
Simulate 8 items , 6 times
1. brown green orange brown yellow yellow red orange
2. blue blue brown brown green orange red yellow
3. brown red yellow red red brown brown green
4. brown brown yellow brown brown yellow blue green
5. red orange orange blue orange blue brown yellow
6. orange green yellow brown brown red yellow orange

Note that going through the dictionary many times can be inefficient. But the dictionary approach can be useful if there are not too many items in the samples and data sets and one wants to make it easy to have a very dynamic way to generate samples from data sets.

One way to achieve some machine efficiency is to convert the Python dictionary to a numpy array.

Note that such efficiency only becomes measurably apparent as the size of the dictionary increases and/or the number of simulated samples increases.

Note that the following code shows more than what is needed to show various ways of processing and transforming dictionaries and lists.

Here is the code to show the Python dictionary.

print("Python dictionary:")
print("")
print("{0:s}".format(str(colorDict1)))

Here is the code to convert and show the Python lists from a Python dictionary.

colorList1 = list(colorDict1.items())
keyList1 = list(colorDict1.keys())
valueList1 = list(colorDict1.values())
print("")
print("Python list from Python dictionary:")
print("")
print("Both: {0:s}".format(str(colorList1)))
print("Keys: {0:s}".format(str(keyList1)))
print("Values: {0:s}".format(str(valueList1)))

Here is the code to create and show the NumPy array from a Python list.

colorArray1 = np.array(colorList1)
keyArray1 = np.array(keyList1)
valueArray1 = np.array(valueList1)
print("")
print("NumPy arrays from Python lists:")
print("")
print("Both: {0:s}".format(str(colorArray1)))
print("")
print("Keys: {0:s}".format(str(keyArray1)))
print("")
print("Values: {0:s}".format(str(valueArray1)))

The above distribution simulation code is then added but using the NumPy array rather than the Python dictionary.

Here is the code to go through the array of counts in valueArray1 to simulate the samples.

r1 = rand1.random()
sum1 = 0
colorPos1 = 0
while True:
	count1 = valueArray1[colorPos1]
	sum1 += count1
	pct1 = float(sum1) / float(total1)
	if pct1 >= r1:
		break
	colorPos1 += 1

The position spos1 at the exit of the while loop is the index of the color selected.

Note the use of the potentially infinite while loop since, if the code is correct, the loop will always break. Here is the Python code [#9]

import numpy as np
import random
from functools import reduce
colorDict1 = {
	"brown" : 6,
	"yellow" : 4,
	"red" : 4,
	"orange" : 2,
	"green" : 2,
	"blue" : 2,
	}
colorLen1 = len(colorDict1)
print("Python dictionary:")
print("")
print("{0:s}".format(str(colorDict1)))
colorList1 = list(colorDict1.items())
keyList1 = list(colorDict1.keys())
valueList1 = list(colorDict1.values())
print("")
print("Python list from Python dictionary:")
print("")
print("Both: {0:s}".format(str(colorList1)))
print("Keys: {0:s}".format(str(keyList1)))
print("Values: {0:s}".format(str(valueList1)))
colorArray1 = np.array(colorList1)
keyArray1 = np.array(keyList1)
valueArray1 = np.array(valueList1)
print("")
print("NumPy arrays from Python lists:")
print("")
print("Both: {0:s}".format(str(colorArray1)))
print("")
print("Keys: {0:s}".format(str(keyArray1)))
print("")
print("Values: {0:s}".format(str(valueArray1)))
total1 = reduce(lambda x,key1: x + colorDict1[key1],colorDict1,0)
print("")
print("Total items: {0:d}".format(total1))
mlen1 = 8
slen1 = 6
print("")
print("Simulate {0:d} items , {1:d} times".format(mlen1,slen1))
for spos1 in range(1,slen1+1):
	seed1 = spos1
	rand1 = random.Random(seed1)
	mList1 = []
	for mpos1 in range(1,mlen1+1):
		# get random number from 0.0 to 1.0
		r1 = rand1.random()
		sum1 = 0
		colorPos1 = 0
		while True:
			count1 = valueArray1[colorPos1]
			sum1 += count1
			pct1 = float(sum1) / float(total1)
			if pct1 >= r1:
				break
			colorPos1 += 1
		mList1.append(keyArray1[colorPos1])
	mText1 = " ".join(mList1)
	print("")
	print("{0:d}. {1:s}".format(spos1,mText1))

Here is the output of the Python code.

Python dictionary:
{'brown': 6, 'yellow': 4, 'red': 4, 'orange': 2, 'green': 2, 'blue': 2}
Python list from Python dictionary:
Both: [('brown', 6), ('yellow', 4), ('red', 4), ('orange', 2), ('green', 2), ('blue', 2)]
Keys: ['brown', 'yellow', 'red', 'orange', 'green', 'blue']
Values: [6, 4, 4, 2, 2, 2]
NumPy arrays from Python lists:
Both: [['brown' '6']
 ['yellow' '4']
 ['red' '4']
 ['orange' '2']
 ['green' '2']
 ['blue' '2']]
Keys: ['brown' 'yellow' 'red' 'orange' 'green' 'blue']
Values: [6 4 4 2 2 2]
Total items: 20
Simulate 8 items , 6 times
1. brown green orange brown yellow yellow red orange
2. blue blue brown brown green orange red yellow
3. brown red yellow red red brown brown green
4. brown brown yellow brown brown yellow blue green
5. red orange orange blue orange blue brown yellow
6. orange green yellow brown brown red yellow orange

Note that what appears to be a list structure in NumPy is actually an array but displayed using square brackets. For large arrays, NumPy will attempt to display the first and last few elements and use some shortcuts to make the text more compact.

by RS admin@robinsnyder.com : 1024 x 640