by Lee Spector (lspector)
Overlap is a new measure of the similarity of possibly nested data structures.
Footnote: I think it’s new; let me know if you think otherwise! Also let me know if you think this should have a different name; I had a half dozen candidates and “overlap” was my favorite of those, but there may well be a better one.
Overlap can be applied to many kinds of data including Push code and symbolic expressions of the kind used in Koza-style genetic programming systems. Unlike many of the code similarity/difference measures described in the genetic programming literature overlap is not directly tied to any specific syntax or semantics or to any specific notion of “editing steps.” It deals with changes to list lengths, nesting level, and ordering in ways that reflect both what changes and what does not change in such transformations. That said, no similarity measure will be perfect for all applications, and I make no such claims about overlap. But overlap does have the following nice features:
- Broadly applicable.
- Handles arbitrary transformations in reasonable ways.
- Easy to describe and to compute.
- Ranges from 0 (for entirely distinct inputs) to 1 (for identical inputs)
Overlap is defined in terms of the collections of the items contained in each of the arguments, including nested items at all levels. For example, if x is (a (b c)) then Items(x) will contain the following five items: a, b, c, (b c) and (a (b c)). Within this context we define:
Overlap(x, y) = the maximal number of simultaneous, disjoint pairings by identity of items in Items(x) with items in Items(y), divided by the size of the larger of Items(x) and Items(y).
Following are some examples, produced by the new overlap-demo function in Clojush (see below):
a , () --- float overlap: 0.0 a , a --- float overlap: 1.0 a , b --- float overlap: 0.0 a , (a b) --- float overlap: 0.33333334 a , (a a) --- float overlap: 0.33333334 (a) , (a b) --- float overlap: 0.33333334 (a b) , (a c) --- float overlap: 0.33333334 (a b) , (a b c) --- float overlap: 0.5 (a b c) , (a b d) --- float overlap: 0.5 (a b c d) , (a b c e) --- float overlap: 0.6 (a b c d) , (d c b a) --- float overlap: 0.8 (a b c d e f g) , (a b c d e f h) --- float overlap: 0.75 (a b) , (a b c d e f) --- float overlap: 0.2857143 (a (b (c))) , (a (b (c))) --- float overlap: 1.0 (a (b (c))) , (a (b (x))) --- float overlap: 0.33333334 (a (b (c))) , (a (x (c))) --- float overlap: 0.5 (a (b (c))) , (x (b (c))) --- float overlap: 0.6666667 (a (b c) (d (e f))) , ((d (e f)) a) --- float overlap: 0.6 (a (b c) (d (e f))) , (a a (b c) (d (e f))) --- float overlap: 0.8181818
I developed overlap for comparisons of Push programs (for various purposes), but I could imagine it having other uses as well. Might it be useful for analysis or diversity maintenance in Koza-style genetic programming? I don’t know, but if anyone wants to explore that I’d love to hear about it. It may be that the syntax restrictions of Koza-style symbolic expressions make the flexibility of overlap unnecessary, and maybe overlap reduces to some other measure in such circumstances… but I don’t know.
Push users should also note that overlap differs in several ways from the “discrepancy” measure that has long been included with Push implementations (e.g. in the code_discrepancy instruction). Aside from the difference in direction (higher overlap means more similar, while higher discrepancy means less similar) a major difference is that overlap is normalized to [0, 1] while discrepancy ranges from 0 up to arbitrarily large integers. Analysis of other differences is left as an exercise to the reader.
An overlap utility function, an overlap-demo function, and a code_overlap Push instruction have been added to the latest commit of Clojush; details below.
Clojush is available as usual from: https://github.com/lspector/Clojush
Here is the latest commit message:
20110607: - Added overlap utility function, overlap-demo (which just prints some examples to show how overlap works), and code_overlap instruction. Overlap can be applied to any two (possibly nested) things and it returns a number between 0 (meaning no overlap) and 1 (meaning exact match). The overlap utility function returns a ratio, but the code_overlap instruction pushes a float. - Removed complex number support from 20110505. There were previous reports of problems and I've noticed problems from the fact that (apply + ()) => zero (as opposed to 0) in the clojush namespace defined by the code as revised for complex number support. If someone knows how to re-introduce complex number support without such problems then please let me know.