Send Close Add comments: (status displays here)
Got it!  This site "www.robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Longest common subsequence
by RS  admin@robinsnyder.com : 1024 x 640


1. Longest common subsequence
These notes were originally presented at the 2012 Lua Conference as Incrementally developing and implementing Hirschberg's longest common subsequence algorithm using Lua.

They have been revised, updated, and changed for this page.

2. Longest common subsequence
nematode-knowledge empty-bottle

LCS example
emt-ole

The LCS = Longest Common Subsequence problem is a dual problem of the SED = Shortest Edit Distance problem.

The solution to these problems are used in open source file comparison tools such as WinMerge and DiffMerge.

In 1974, Hirschberg published a reasonably space and time efficient solution to these problems.

3. Subsequence
Technical definition of a subsequence.

String C = c1c 2...c p is a subsequence of string A = a1a 2...a m if there as a mapping such that F(i) = k only if c i is a k and F is a monotone strictly increasing function (that is, (F(i) = u) and (F(j) = v) and (i < j) imply that (u < v)).

4. Common subsequence
String C is a common subsequence of strings A and B iff

5. Problem
Given strings find string such that C is a common subsequence of both A and B and p is maximized. C is then called a maximal common subsequence or Longest Common Subsequence.

6. Alphabet
Alphabets examples:

7. Example strings
Text stringsExample strings:

8. LCS
Text strings with connections

9. Symbols
Symbols can be anything that can be matched. For example purposes, letters will be used.

10. DNA
DNA basesDNA consists of a sequence of four different nucleotides.

11. DNA
DNA two chains
AGGCTATCACCTGACCTCCAGGCCGATGCCC... TAGCTATCACGACCGCGGTCGATTTGCCCGAC...

The DNA sequence is about 3 billion nucleotides (6 billion bits) for a human.

12. DNA code
DNA codeThree nucleotides go together to specify one amino acid (or start or stop instructions).

The 4*4*4 = 64 triples specify 64 codes.

13. WinMerge
WinMergeFile comparison: (line oriented, useful for regression testing, etc.): Useful WinMerge setting:

14. DiffMerge
DiffMergeFile comparison: (line oriented, useful for regression testing, etc.):

15. Line comparison
Winmerge line comparisonMake each letter a line in a file.

Note: LCS can be used on individual lines to see similarities and differences within a line.

Lines can be made to be similar using regular expression pattern matching.

16. Change management
The SED = Shortest Edit Distance is a dual problem of the LCS = Longest Common Subsequence problem.
nematode-knowledge empty-bottle

LCS example
emt-ole

This algorithm is the basis of change management and regression testing systems.

17. Example in WinMerge
LCS example in WinMergeSome analysis/compare programs.
nematode-knowledge empty-bottle


18. Approach
Approach:

19. Program and output

a = "empty_bottle" b = "nematode_knowledge" print("a=[" .. a .. "]") print("b=[" .. b .. "]") local c = top_down_lcs1(a, b) print(" c=[" .. c .. "]")


20. Output:

a=[empty_bottle] b=[nematode_knowledge] c=[emt_ole]

Time and space efficiency depends on the algorithm used.

21. Possible matches

22. Start
LCSStart from the end of both strings.

23. Compare
Compare both versions for symmetry: 2*2*2 = 8 approaches. All yield the same LCS.

24. Match
LCS progression

25. Non-Match (1)
LCS progression

26. Non-Match (2)
LCS progression

27. Next
LCS progression

28. Recursive top down backward

a 1 a 2 ... a m-1 a m b 1 b 2 ... b n-1 x n


function lcs_1b(a, b) local m = #a local n = #b if (m == 0) or (n == 0) then return "" elseif string.sub(a, m, m) == string.sub(b, n, n) then return lcs_1b(string.sub(a, 1, m-1), string.sub(b, 1, n-1)) .. string.sub(a, m, m) else local a1 = lcs_1b(a, string.sub(b, 1, n-1)) local b1 = lcs_1b(string.sub(a, 1, m-1), b) return math.max(#a1, #b1) end end

Time and space INEFFICIENT!!!

29. Recursive top down forward

a 1 a 2 ... a m-1 a m b 1 b 2 ... b n-1 x n


function lcs_1f(a, b) local m = #a local n = #b if (m == 0) or (n == 0) then return "" elseif string.sub(a, 1, 1) == string.sub(b, 1, 1) then return string.sub(a, 1, 1) .. lcs_1f(string.sub(a, 2, m), string.sub(b, 2, n)) else local a1 = lcs_1f(a, string.sub(b, 2, n)) local b1 = lcs_1f(string.sub(a, 2, m), b) return math.max(#a1, #b1) end end

Time and space INEFFICIENT!!!

30. Maximum subsequence length

function lcs_2b(a, b) local m = #a local n = #b if (m == 0) or (n == 0) then return 0 elseif string.sub(a, m, m) == string.sub(b, n, n) then return lcs_2b(string.sub(a, 1, m-1), string.sub(b, 1, n-1)) + 1 else local a1 = lcs_2b(a, string.sub(b, 1, n-1)) local b1 = lcs_2b(string.sub(a, 1, m-1), b) return math.max(a1, b1) end end


31. Output

a=[empty_bottle] b=[nematode_knowledge] c=[7]


32. Next step

33. Use a list for A and B
Use a list for A and B.
A = {} setDefault(A, "") for i=1,string.len(a) do A[i] = string.sub(a, i, i) end B = {} setDefault(B, "") for j=1,string.len(b) do B[j] = string.sub(b, j, j) end io.write("A=[") for i,a in pairs(A) do io.write(a) end print("]") io.write("B=[") for j,b in pairs(B) do io.write(b) end print("]")


34. Modified code

function lcs_3b(A, i, B, j) if (i == 0) or (j == 0) then return 0 elseif A[i] == B[j] then return lcs_3b(A, i-1, B, j-1) + 1 else local a1 = lcs_3b(A, i, B, j-1) local b1 = lcs_3b(A, i-1, B, j) return math.max(a1, b1) end end


35. Call

c = lcs_3b(A, #A, B, #B) print("c=[" .. c .. "]")


36. Observation
Observation: L(i, j) is a maximal possible length common subsequence of A 1i and B 1j.

Initialization of L, the Length matrix.
L = {} for i=1,#A do L[i] = {} for j=1,#B do L[i][j] = -1 end end

For convenience, L is initially defined as -1 everywhere (explicitly or via default metatable method).

37. Initial L matrix

L= n e m a t o d e _ k n o w l e d g e e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 m -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 p -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 _ -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 b -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 o -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 l -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1


38. Initial L matrix
LCS

39. Compute the L matrix

function lcs_4b(A, i, B, j, L) local p if (i == 0) or (j == 0) then p = 0 else if A[i] == B[j] then p = lcs_4b(A, i-1, B, j-1, L) + 1 else local a1 = lcs_4b(A, i, B, j-1, L) local b1 = lcs_4b(A, i-1, B, j, L) p = math.max(a1, b1) end L[i][j] = p end return p end

The L matrix is computed.

40. Computed L matrix

L= n e m a t o d e _ k n o w l e d g e e 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 m 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -1 p 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -1 t 0 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 -1 y 0 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 -1 _ 0 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 -1 b 0 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 -1 o 0 1 2 2 -1 4 4 4 4 4 4 5 5 5 5 5 5 -1 t 0 1 2 2 3 4 4 4 4 4 4 5 5 5 5 5 5 -1 t -1 -1 -1 -1 3 4 4 4 4 4 4 5 5 5 5 5 5 -1 l -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 6 6 -1 e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 7


41. Computed L matrix
LCS

42. Recover the LCS: approach
LCS

43. Recover the LCS: code
To recover the LCS from L, backtrack through the matrix.
function path_extract1(L, A, i, B, j) if (i == 0) or (j == 0) then return "" elseif A[i] == B[j] then return path_extract1(L, A, i-1, B, j-1) .. A[i] else local x1, x2 if j == 1 then x1 = -1 else x1 = L[i][j-1] end if i == 1 then x2 = -1 else x2 = L[i-1][j] end if x1 > x2 then return path_extract1(L, A, i, B, j-1) else return path_extract1(L, A, i-1, B, j) end end end


44. Call the extraction
Call as follows.
p = lcs_6b(A, 1, #A, B, 1, #B, L) print("p=[" .. p .. "]") c = path_extract1(L, A, #A, B, #B) print("c=[" .. c .. "]")

This is time efficient but space inefficient!

45. Efficiency
The recursive solution is very inefficient.

Solution: Memoization.
function lcs_5b(A, i, B, j, L) local p if (i == 0) or (j == 0) then p = 0 else p = L[i][j] if p < 0 then if A[i] == B[j] then p = lcs_5b(A, i-1, B, j-1, L) + 1 else local a1 = lcs_5b(A, i, B, j-1, L) local b1 = lcs_5b(A, i-1, B, j, L) p = math.max(a1, b1) end L[i][j] = p end end return p end

The same L matrix is computed.

46. Add the start and stop indexes

function lcs_6b(A, i1, i2, B, j1, j2, L) local p2 if (i2 < i1) or (j2 < j1) then p = 0 else p = L[i2][j2] if p < 0 then if A[i2] == B[j2] then p = lcs_6b(A, i1, i2-1, B, j1, j2-1, L) + 1 else local a1 = lcs_6b(A, i1, i2, B, j1, j2-1, L) local b1 = lcs_6b(A, i1, i2-1, B, j1, j2, L) p = math.max(a1, b1) end L[i2][j2] = p end end return p end

The same L matrix is computed.

Note: Recursion can be converted to iteration. Left as an exercise.

47. Source
Hirshberg page 1Accessible 3-page paper with which to get started.

48. Source
Hirshberg page 2(page 2/3)

49. Source
Hirshberg page 3(page 3/3)

50. Source

51. Hirschberg Approach
LCS

52. Hirschberg Approach
LCS

53. Hirschberg Approach
LCS

54. Hirschberg Approach
LCS

55. Hirschberg Approach
LCS

56. Animation
LCS

57. Hirschberg (full L, rows)

function dpa_traverse_4(L, A, B, i1, i3, j1, j3, dx) for i=i1,i3,dx do for j=j1,j3,dx do if A[i] == B[j] then if (i == i1) or (j == j1) then L[i][j] = 1 else L[i][j] = 1 + L[i-dx][j-dx] end else local y1, y2 if i == i1 then y1 = 0 else y1 = L[i-dx][j] end if j == j1 then y2 = 0 else y2 = L[i][j-dx] end L[i][j] = math.max(y1, y2) end end end end


58. Hirschberg (full L, main)

function lcs_hirschberg_4(L, A, B, i1, i3, j1, j3, level) if not L[level] then L[level] = initL1(A, B) hlist1[level] = {} end if j1 > j3 then for i=i1,i3 do extractPut1(A[i]," ",1) end elseif i1 == i3 then table.insert(hlist1[level], 4 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j1 .. " " .. j3 .. " 1.5 (" .. "990099" .. ") doBox4") local j2 = 0 for j=j3,j1,-1 do if (A[i1] == B[j]) and (j2 == 0) then j2 = j extractPut1(A[i1],B[j],1) else extractPut1(" ",B[j],1) end end if j2 == 0 then extractPut1(A[i1]," ",1) else table.insert(hlist1[level], 5 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j2 .. " " .. j2 .. " 0.5 (" .. "00CC00" .. ") doBox4") table.insert(hlist1[level], 0 .. " " .. j2 .. " " .. #ts1 - i1 .. " " .. string.byte(A[i1]) .. " " .. "(CCFFCC) putChar2") end else local i2 = math.floor((i1+i3)/2) dpa_traverse_4(L[level], A, B, i1, i2, j1, j3, 1) dpa_traverse_4(L[level], A, B, i3, i2+1, j3, j1, -1) local j2 = j1-1 local k1 = 0 for j=j1,j3 do local k k = L[level][i2][j] + L[level][i2+1][j] if k > k1 then k1 = k j2 = j end end table.insert(hlist1[level], 1 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i2 .. " " .. j1 .. " " .. j2 .. " 1.0 (0000CC) doBox4") table.insert(hlist1[level], 1 .. " " .. #ts1 - i2 .. " " .. #ts1 - i3 .. " " .. j2+1 .. " " .. j3 .. " 1.0 (0000CC) doBox4") table.insert(hlist1[level], 2 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j1 .. " " .. j3 .. " 1.0 (000000) doBox4") table.insert(hlist1[level], 3 .. " " .. #ts1 - i2 - 1 .. " " .. #ts1 - i2 + 1 .. " " .. j1 .. " " .. j3 .. " 1.0 (CC0000) doBox4") lcs_hirschberg_4(L, A, B, i1, i2, j1, j2, level+1) lcs_hirschberg_4(L, A, B, i2+1, i3, j2+1, j3, level+1) end end


59. Change management


Information sign More: WinMerge: Comparing files
Information sign More: LCS in Python

60. End of page

by RS  admin@robinsnyder.com : 1024 x 640