Send Close Add comments: (status displays here)
Got it!  This site "www.robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
LCS in Python
by RS  admin@robinsnyder.com : 1024 x 640


1. Python
Programs such as WinMerge, DiffMerge, etc., are useful for manual comparison of files (and folders).

In many cases, automated methods are necessary. Python provides some ways to automate the comparison of files, lines, etc.

2. Hashes for file comparison
One way to do all or nothing comparison is with hashes. Hashes can be stored, used remotely, etc.

Hashes have use in security (where the hash algorithm should be very discriminating and slow) and in file comparisons (where the hash algorithm should be discriminating and fast). Here is the Python code [#1]

Here is the output of the Python code.


3. Remarks
The example shows just small strings which would be easy to compare via text comparison.

What happens when the text is the text of a huge file?

And what if the files are separated and on different computers (e.g., in a web update). Only the hashes need to be transferred and managed remotely, not the entire files. Same for cluster computing where files may be on different computers.

4. Similarities and differences
In many cases, a yes or no answer is not sufficient. One desires to know the similarities and differences of a file.

One example is the percent match of one file with another.

5. Example
Consider 10,000 files to be compared with 10,000 files, each with about 1,000 lines. Since each pair of file comparisons can be assumed independent of the others, cluster computing starts to look promising.

A start is to compare one file with another file for similarities and differences.

6. Difflib
The difflib package, part of the Python installation, can be used for various aspects of the LCS (Longest Common Subsequence) problem.

7. Example usage
Here is the Python code [#2]

Here is the output of the Python code.


8. Limitations/features
Note that in some cases, the "replace" feature ignores certain parts of a subsequence. I have not been able to determine how to have that part recognized.

9. End of page

by RS  admin@robinsnyder.com : 1024 x 640