Python: NLTK and bi-chars

The hard-coded text used is part of the Declaration of Independence.

textList1 = [
	"We hold these truths to be self-evident, that all men are created equal,",
	"that they are endowed by their Creator with certain unalienable Rights,",
	"that among these are Life, Liberty, and the pursuit of Happiness.",
	"That to secure these rights, Governments are instituted among Men,",
	"deriving their just powers from the consent of the governed.",
	"That whenever any Form of Government becomes destructive of these ends,",
	"it is the Right of the People to alter or to abolish it, and to institute new Government,",
	"having its foundation on such principles and organizing its powers in such form,",
	"as to them shall seem most likely to effect their Safety and Happiness.",
	]

The text can be obtained from the list by joining.

textText1 = " ".join(textList1)
textText1 = textText1.lower()

Regular expressions can be used to obtain words in the text.

wordList1 = re.findall("[A-Za-z']+",textText1)

The number of times each word occurs can be obtained from the nltk package as follows.

Note that this is not very hard to explicitly program, if needed.

freqDict1 = nltk.FreqDist(wordList1)

The dictionary has a key of a word and value of the count.

The distribution can then be printed.

print("Frequency distribution:")
print("")
for word1,count1 in freqDict1.items():
	print("\t{0:d} \"{1:s}\"".format(count1,word1))

Here is the Python code [#6]

import re
import nltk
from operator import itemgetter
textList1 = [
	"We hold these truths to be self-evident, that all men are created equal,",
	"that they are endowed by their Creator with certain unalienable Rights,",
	"that among these are Life, Liberty, and the pursuit of Happiness.",
	"That to secure these rights, Governments are instituted among Men,",
	"deriving their just powers from the consent of the governed.",
	"That whenever any Form of Government becomes destructive of these ends,",
	"it is the Right of the People to alter or to abolish it, and to institute new Government,",
	"having its foundation on such principles and organizing its powers in such form,",
	"as to them shall seem most likely to effect their Safety and Happiness.",
	]
textText1 = " ".join(textList1)
textText1 = textText1.lower()
wordList1 = re.findall("[A-Za-z']+",textText1)
freqDict1 = nltk.FreqDist(wordList1)
print("Frequency distribution:")
print("")
for word1,count1 in freqDict1.items():
	print("\t{0:d} \"{1:s}\"".format(count1,word1))

Here is the output of the Python code.

Frequency distribution:
	1 "we"
	1 "hold"
	4 "these"
	1 "truths"
	7 "to"
	1 "be"
	1 "self"
	1 "evident"
	5 "that"
	1 "all"
	2 "men"
	4 "are"
	1 "created"
	1 "equal"
	1 "they"
	1 "endowed"
	1 "by"
	3 "their"
	1 "creator"
	1 "with"
	1 "certain"
	1 "unalienable"
	2 "rights"
	2 "among"
	1 "life"
	1 "liberty"
	4 "and"
	5 "the"
	1 "pursuit"
	5 "of"
	2 "happiness"
	1 "secure"
	1 "governments"
	1 "instituted"
	1 "deriving"
	1 "just"
	2 "powers"
	1 "from"
	1 "consent"
	1 "governed"
	1 "whenever"
	1 "any"
	2 "form"
	2 "government"
	1 "becomes"
	1 "destructive"
	1 "ends"
	2 "it"
	1 "is"
	1 "right"
	1 "people"
	1 "alter"
	1 "or"
	1 "abolish"
	1 "institute"
	1 "new"
	1 "having"
	2 "its"
	1 "foundation"
	1 "on"
	2 "such"
	1 "principles"
	1 "organizing"
	1 "in"
	1 "as"
	1 "them"
	1 "shall"
	1 "seem"
	1 "most"
	1 "likely"
	1 "effect"
	1 "safety"

Often, bi-grams are used at the word level. Here, bi-grams are used at the character level.

Claude Shannon used this approach at a manual level (computers like we have today did not exist at the time).

Here is the Python code [#7]

import random
textList1 = [
	"We hold these truths to be self-evident, that all men are created equal,",
	"that they are endowed by their Creator with certain unalienable Rights,",
	"that among these are Life, Liberty, and the pursuit of Happiness.",
	"That to secure these rights, Governments are instituted among Men,",
	"deriving their just powers from the consent of the governed.",
	"That whenever any Form of Government becomes destructive of these ends,",
	"it is the Right of the People to alter or to abolish it, and to institute new Government,",
	"having its foundation on such principles and organizing its powers in such form,",
	"as to them shall seem most likely to effect their Safety and Happiness.",
	]
class bichar():
	def __init__(self, text1):
		self.text1 = text1
		self.gramDict1 = {}
		self.countDict1 = {}
		count0 = 0
		for pos1,ch1 in enumerate(self.text1):
			if pos1 != 0:
				count0 += 1
				if ch0 in self.gramDict1:
					gramDict2 = self.gramDict1[ch0]
					count1 = self.countDict1[ch0]
				else:
					gramDict2 = {}
					self.gramDict1[ch0] = gramDict2
					count1 = 0
				self.countDict1[ch0] = count1+1
				if ch1 in gramDict2:
					count1 = gramDict2[ch1]
				else:
					count1 = 0
				gramDict2[ch1] = count1+1
			ch0 = ch1
		self.gramCount1 = count0
	def show(self):
		# print the bi-char dictionary
		gramList1 = list(self.gramDict1.keys())
		gramList1.sort()
		print("[{0:,}] : bi-chars".format(self.gramCount1))
		for ch0 in gramList1:
			gramDict2 = self.gramDict1[ch0]
			count1 = self.countDict1[ch0]
			s1 = str(gramDict2)
			print("[{0:d}] \"{1:s}\" : {2:s}".format(count1,ch0,s1))
	def init(self, seed1):
		self.seed1 = seed1
		self.rand1 = random.Random(self.seed1)
		while True:
			ch1 = self.base()
			if ch1.isalpha():
				self.ch0 = ch1
				return ch1
	def base(self):
		ch1 = self.get(self.gramDict1,self.countDict1,self.gramCount1)
		return ch1
	def step(self):
		gramDict2 = self.gramDict1[self.ch0]
		total1 = self.countDict1[self.ch0]
		ch2 = self.get(gramDict2,gramDict2,total1)
		return ch2
	def get(self, charDict1, countDict1, total1):
		r1 = self.rand1.random()
		sum1 = 0
		for ch1 in charDict1:
			x0 = countDict1[ch1]
			sum1 += x0
			pct1 = float(sum1) / float(total1)
			if pct1 >= r1:
				self.ch0 = ch1
				return ch1
		return ch1
textText1 = " ".join(textList1)
textText1 = textText1.lower()
bichar1 = bichar(textText1)
print("Show:")
print("")
bichar1.show()
for pass1 in range(1,2+1):
	if pass1 == 1:
		way1 = "uni-chars"
	else:
		way1 = "bi-chars"
	print("")
	print("Generate using {0:s}:".format(way1))
	print("")
	for pos1 in range(0,9+1):
		seed1 = pos1
		charList2 = []
		ch2 = bichar1.init(seed1)
		pos2 = 0
		wordLen2 = 0
		while ch2 != ".":
			if not ch2.isalpha():
				wordLen2 = 0
			else:
				wordLen2 += 1
				if wordLen2 > 9:
					charList2.append(" ")
					wordLen2 = 0
			charList2.append(ch2)
			if pass1 == 1:
				ch2 = bichar1.base()
			else:
				ch2 = bichar1.step()
			pos2 += 1
			if pos2 > 80:
				if ch2.isalpha():
					ch2 = "."
					break
		charList2.append(ch2)
		charText2 = "".join(charList2)
		print("{0:d}. {1:s}".format(pos1,charText2))

Here is the output of the Python code.

Show:
[652] : bi-chars
[109] " " : {'h': 4, 't': 27, 'b': 3, 's': 7, 'a': 15, 'm': 3, 'c': 4, 'e': 4, 'w': 2, 'u': 1, 'r': 3, 'l': 3, 'p': 5, 'o': 8, 'g': 4, 'i': 8, 'd': 2, 'j': 1, 'f': 4, 'n': 1}
[11] "," : {' ': 11}
[1] "-" : {'e': 1}
[2] "." : {' ': 2}
[33] "a" : {'t': 8, 'l': 5, 'r': 4, 'i': 1, 'b': 2, 'm': 2, 'n': 6, 'p': 2, 'v': 1, 's': 1, 'f': 1}
[6] "b" : {'e': 3, 'y': 1, 'l': 1, 'o': 1}
[11] "c" : {'r': 2, 'e': 1, 'u': 1, 'o': 2, 't': 2, 'h': 2, 'i': 1}
[15] "d" : {' ': 8, 'e': 3, 'o': 1, '.': 1, 's': 1, 'a': 1}
[77] "e" : {' ': 20, 's': 9, 'l': 2, 'v': 2, 'n': 11, 'a': 2, 'd': 4, 'q': 1, 'y': 1, 'i': 3, 'r': 11, ',': 1, 'c': 3, 'o': 1, 'w': 1, 'm': 2, 'e': 1, 'f': 1, 't': 1}
[14] "f" : {'-': 1, 'e': 3, ' ': 5, 'r': 1, 'o': 3, 'f': 1}
[13] "g" : {'h': 3, ' ': 5, 'o': 4, 'a': 1}
[33] "h" : {'o': 1, 'e': 15, 's': 1, 'a': 9, ' ': 4, 't': 3}
[36] "i" : {'d': 1, 'r': 3, 't': 8, 'n': 10, 'e': 1, 'g': 3, 'f': 1, 'b': 1, 'v': 2, 's': 2, 'o': 1, 'p': 1, 'z': 1, 'k': 1}
[1] "j" : {'u': 1}
[1] "k" : {'e': 1}
[17] "l" : {'d': 1, 'f': 1, 'l': 2, ' ': 2, ',': 1, 'i': 5, 'e': 3, 't': 1, 'y': 1}
[14] "m" : {'e': 6, 'o': 3, ' ': 4, ',': 1}
[39] "n" : {'t': 5, ' ': 5, 'd': 7, 'a': 2, 'g': 5, 'e': 5, 'm': 3, 's': 3, ',': 1, 'y': 1, 'c': 1, 'i': 1}
[36] "o" : {'l': 2, ' ': 7, 'w': 3, 'r': 5, 'n': 5, 'f': 5, 'v': 4, 'm': 2, 'p': 1, 'u': 1, 's': 1}
[11] "p" : {'u': 1, 'p': 2, 'i': 2, 'o': 2, 'e': 1, 'l': 2, 'r': 1}
[1] "q" : {'u': 1}
[34] "r" : {'u': 2, 'e': 7, ' ': 7, 't': 2, 'i': 5, 's': 3, 'n': 4, 'o': 1, 'm': 2, 'g': 1}
[36] "s" : {'e': 8, ' ': 10, ',': 3, 'u': 3, 's': 2, '.': 2, 't': 5, 'h': 2, 'a': 1}
[65] "t" : {'h': 21, 'r': 2, 'o': 8, ',': 3, ' ': 13, 'e': 4, 'a': 1, 's': 5, 'y': 2, 'i': 4, 'u': 2}
[13] "u" : {'t': 3, 'a': 1, 'n': 2, 'r': 2, 'i': 1, 's': 1, 'c': 3}
[9] "v" : {'i': 3, 'e': 6}
[7] "w" : {'e': 4, 'i': 1, 'h': 1, ' ': 1}
[6] "y" : {' ': 5, ',': 1}
[1] "z" : {'i': 1}
Generate using uni-chars:
0. il tlnhtsmt ir mpnahiabt edrmgta ntevonuwt a ha s gndehtces ftntgssdso s   rrteiam,am.
1. i tdrneendi wdv qaeetco le dt   t ensr .
2. gyeeniuhr rs dov.
3. tosreen  jtntr ratiueish eatvavmond cae  gdrhtooss abcapu agasv ns e,penl  iaerevhap.
4. eo eomni t  e mnnn hri,aerut tecathmsa,tlsd hneer tthz l r oihsaee ir h tefayige .
5. rincimetc raet tse  mi n rewa  pa gtu cfgaho  ehswbhhnt htfegei,enoswe y icchotiei n.
6. nnt wutioi niltb sn nb,hennde r ys rocmtrfnp eoshseateu h t  afeieharh ooekeliop pau.
7. h retoetede elne rysopea  ehn srotee blhsthnf stai peli teuisahfs st,ctuefr.
8. gefe k rttt ne e laoe perouhftr as eqt qsia oa ieffy,t  tteatf  eh lona o ug iheih .
9. to awtaesreo fti  etmsioie uile   n aaeotelae sesa weteetmlo res rgesr  .
Generate using bi-chars:
0. ind and ecowhaltu cherof ton, applidele y appisen fores, t poveedeng cuthamo om gop.
1. it, rnth antofem hony tollin theserigh tsesssheri t is, her t pes ang athaf at, ons .
2. gath pesuritor sas, thor apon he afrechapl esasers mopoveche n, atrseceow poneallip o .
3. tone h thtoved ucofrerit h anssher dse tines, suroffoss e, ine g bergh athapurus ik.
4. ellf ive me thtstsesu sty ey to d.
5. rstifolip ecr t creste o we ithts f llt fowe thed l blit, tuterucon themerm, s, ter.
6. ns thars owharigha nd.
7. hequr t, athere w cose, thereamen theindendo feais.
8. ghaththar atse hevelfeth alinenecth emo justhecrs eigangheng h an thes ansellike r hal.
9. to id irie anan thshando blevither echelid it t ithenthsh elirit f-ed tucrshed ssuc.