Overlap

June 7, 2011
by Lee Spector (lspector)

Overlap is a new measure of the similarity of possibly nested data structures.

Footnote: I think it’s new; let me know if you think otherwise! Also let me know if you think this should have a different name; I had a half dozen candidates and “overlap” was my favorite of those, but there may well be a better one.

Overlap can be applied to many kinds of data including Push code and symbolic expressions of the kind used in Koza-style genetic programming systems. Unlike many of the code similarity/difference measures described in the genetic programming literature overlap is not directly tied to any specific syntax or semantics or to any specific notion of “editing steps.” It deals with changes to list lengths, nesting level, and ordering in ways that reflect both what changes and what does not change in such transformations. That said, no similarity measure will be perfect for all applications, and I make no such claims about overlap. But overlap does have the following nice features:

  • Broadly applicable.
  • Handles arbitrary transformations in reasonable ways.
  • Easy to describe and to compute.
  • Ranges from 0 (for entirely distinct inputs) to 1 (for identical inputs)

Overlap is defined in terms of the collections of the items contained in each of the arguments, including nested items at all levels. For example, if x is (a (b c)) then Items(x) will contain the following five items: a, b, c, (b c) and (a (b c)). Within this context we define:

Overlap(x, y) = the maximal number of simultaneous, disjoint pairings by identity of items in Items(x) with items in Items(y), divided by the size of the larger of Items(x) and Items(y).

Following are some examples, produced by the new overlap-demo function in Clojush (see below):

a , () --- float overlap: 0.0 
a , a --- float overlap: 1.0 
a , b --- float overlap: 0.0 
a , (a b) --- float overlap: 0.33333334 
a , (a a) --- float overlap: 0.33333334 
(a) , (a b) --- float overlap: 0.33333334 
(a b) , (a c) --- float overlap: 0.33333334 
(a b) , (a b c) --- float overlap: 0.5 
(a b c) , (a b d) --- float overlap: 0.5 
(a b c d) , (a b c e) --- float overlap: 0.6 
(a b c d) , (d c b a) --- float overlap: 0.8 
(a b c d e f g) , (a b c d e f h) --- float overlap: 0.75 
(a b) , (a b c d e f) --- float overlap: 0.2857143 
(a (b (c))) , (a (b (c))) --- float overlap: 1.0 
(a (b (c))) , (a (b (x))) --- float overlap: 0.33333334 
(a (b (c))) , (a (x (c))) --- float overlap: 0.5 
(a (b (c))) , (x (b (c))) --- float overlap: 0.6666667 
(a (b c) (d (e f))) , ((d (e f)) a) --- float overlap: 0.6 
(a (b c) (d (e f))) , (a a (b c) (d (e f))) --- float overlap: 0.8181818 

I developed overlap for comparisons of Push programs (for various purposes), but I could imagine it having other uses as well. Might it be useful for analysis or diversity maintenance in Koza-style genetic programming? I don’t know, but if anyone wants to explore that I’d love to hear about it. It may be that the syntax restrictions of Koza-style symbolic expressions make the flexibility of overlap unnecessary, and maybe overlap reduces to some other measure in such circumstances… but I don’t know.

Push users should also note that overlap differs in several ways from the “discrepancy” measure that has long been included with Push implementations (e.g. in the code_discrepancy instruction). Aside from the difference in direction (higher overlap means more similar, while higher discrepancy means less similar) a major difference is that overlap is normalized to [0, 1] while discrepancy ranges from 0 up to arbitrarily large integers. Analysis of other differences is left as an exercise to the reader.

An overlap utility function, an overlap-demo function, and a code_overlap Push instruction have been added to the latest commit of Clojush; details below.

Clojush is available as usual from: https://github.com/lspector/Clojush

Here is the latest commit message:

20110607: - Added overlap utility function, overlap-demo (which just prints
            some examples to show how overlap works), and code_overlap
            instruction. Overlap can be applied to any two (possibly
            nested) things and it returns a number between 0 (meaning
            no overlap) and 1 (meaning exact match). The overlap utility
            function returns a ratio, but the code_overlap instruction
            pushes a float.
          - Removed complex number support from 20110505. There were previous
            reports of problems and I've noticed problems from the fact that
            (apply + ()) => zero (as opposed to 0) in the clojush namespace
            defined by the code as revised for complex number support. If
            someone knows how to re-introduce complex number support without
            such problems then please let me know.

Removed Clojush complex number support

June 7, 2011
by Lee Spector (lspector)

There were reports of other problems, and I just experienced a problem that I traced to the fact that (apply + ()) => zero (as opposed to 0) in the clojush namespace defined by the code as it was revised for complex number support. If someone knows how to re-introduce complex number support without such problems then please let me know.

Clojush is available as usual from: 
https://github.com/lspector/Clojush


Size limits on zippers in Clojush

May 26, 2011
by Lee Spector (lspector)

A small but important bug fix if you’re using zippers in Clojush.

Available as usual from: https://github.com/lspector/Clojush

20110526: - Enforce size limits on zipper manipulation results.

Keijzer-style error scaling in Clojush

May 17, 2011
by Lee Spector (lspector)

Added Clojush support for Keijzer-style error scaling (developed for symbolic regression applications, but maybe it’ll be useful elsewhere). Details below and in the source files.

Available as usual from: https://github.com/lspector/Clojush

20110517: - Added a "scaled-errors" function to support error-scaling as 
            described by Maarten Keijzer in Scaled Symbolic Regression, in
            Genetic Programming and Evolvable Machines 5(3), pp. 259-269, 
            September 2004. This must be used in a problem's error function,
            and then the outputs of the evolved program must be "unscaled."
            See the documentation string for scaled-errors and also
            examples/scaled_sextic.clj for details.
          - Added examples/scaled_sextic.clj to demonstrate the use of
            scaled-errors.
          - Changed examples/sextic.clj to use squared errors and an error
            threshold, in order to facilitate comparisons between the
            versions that do and don't use error scaling.
          - Made minor changes to the korns_regression_p12 example.

Complex numbers in Clojush (by Brian Martin)

May 8, 2011
by Lee Spector (lspector)

Brian Martin has added support for complex numbers in Clojush; details below. Anyone feel like contributing an example file that uses this?

Available as usual from: https://github.com/lspector/Clojush

20110505: - Added complex number support.  New instructions for the 'complex' 
            stack include: pop, dup, swap, rot, flush, eq, stackdepth, yank, 
            yankdup, shove, rand, add, sub, mult, divide, fromfloat, 
            frominteger, fromfloats, fromintegers, conjugate, magnitude, 
            and principal_sqrt. (Brian Martin)

Gaussian mutation in Clojush

April 24, 2011
by Lee Spector (lspector)

A couple of Clojush updates, mostly involving a new mutation operator that perturbs floating-point constants in a program with Gaussian noise. Details below.

Available as usual from: https://github.com/lspector/Clojush

220110424: - Added Gaussian mutation for floating point literals. This is 
            a genetic operator on par with ordinary mutation, crossover,
            and simplification, with the probability for applying this operator
            set with the gaussian-mutation-probability argument to pushgp
            (which defaults to zero). The behavior of this operator, when used,
            is controlled by two other arguments to pushgp, 
            gaussian-mutation-per-number-mutation-probability (which is the
            probability that any particular floating point number in the 
            program will actually be mutated -- this defaults to 0.5) and
            gaussian-mutation-standard-deviation (which specifies the standard
            deviation of the Gaussian noise generator that is used to 
            produce changes to numbers -- this defaults to 0.1).
          - Added examples/gaussian_mutation_demo.clj to demonstrate Gaussian
            mutation.
          - Added examples/korns_regression_p12.clj, a symbolic regression
            problem based on Michael Korns's draft chapter from GPTP11.

No-pop tagging in Clojush

April 9, 2011
by Lee Spector (lspector)

Added an option that changes the behavior of tagging instructions in Clojush, so that they tag without popping the tagged items off of the stacks from which they came. Details below.

Available as usual from: https://github.com/lspector/Clojush

20110409: - Added support for no-pop tagging through a var called
            global-pop-when-tagging (holding an atom with a boolean value)
            and a boolean argument to pushgp called pop-when-tagging. 
            The defaults are true, for backwards compatibility. If 
            @global-pop-when-tagging is false (which will result from 
            passing false as a :pop-when-tagging keyword argument to pushgp)
            then executing instructions of the form tag_[type]_[number]
            will tag a value as usual, but the tagged item will not be popped
            from its source stack.

Writing new Push instructions for Clojush

April 7, 2011
by Lee Spector (lspector)

A few key facts and tips for anyone who wants to write new Push instructions for use in Clojush:

  • All Push instructions in Clojush must be written to take one argument (a Push interpreter state) and return a Push interpreter state. When you think of the execution of a Push program you should generally think of the instructions as taking arguments from stacks and pushing results on stacks, but at the Clojure level the function that implements an instruction actually takes a complete Push interpreter state (containing all of the stacks with all of their contents) and returns a complete Push interpreter state (containing stacks that have possibly been modified by the execution of the instruction).
  • Push instructions in Clojush should be defined not with the normal Clojure function definition macro “defn” but rather with “define-registered”. This does the work of defn but also registers the instruction in the Push instruction table, so that the Push interpreter will know that it is a Push instruction and know how to call it properly.
  • To understand what a call to define-registered should look like it helps to first appreciate how to define Clojure functions in yet another way, with the “def” special form (which creates a var) and the “fn” special form (which creates a function). A function defined as (defn foo [x] (+ x 2)), which returns the value of its single argument plus 2, can also be defined as (def foo (fn [x] (+ x 2))). In this version the fn expression returns an anonymous function, which def then puts in a var called foo. This is just like defining a constant var, with an expression like (def bar 3), but with a function as the value. The defn version is a little neater, but defn is really just a macro that expands into the version with def and fn.
  • Calls to define-registered should look like calls to def with a function value, in particular with a function value that takes one argument (which will be a Push interpreter state) and which returns a possibly-modified Push interpreter state.
  • A Push interpreter state is a Clojure struct, which is really just a Clojure hash-map with some pre-defined keys. You can access the stacks using hash-map accessors to pull things out (for example, (:integer state) to get the integer stack) and assoc to return a state with a modified stack (for example, (assoc state :integer ()) to return state but with an empty integer stack), but it is probably simpler and clearer in most cases to use the utility function stack-ref to access stacks and the functions push-item and pop-item to return states with modified stacks.
  • One very simple instruction definition that is in a lot of examples is:
    (define-registered in
      (fn [state] (push-item (stack-ref :auxiliary 0 state) :integer state)))

    This defines and registers a function called “in” that takes one argument, a state, and returns the state that results from pushing something — specifically the thing that’s on the top of the auxiliary stack — onto the integer stack in the given state.

  • An example that demonstrates a bit more, and which is probably a good template for many more complex instructions, is float_sin:
    (define-registered float_sin
      (fn [state]
        (if (not (empty? (:float state)))
          (push-item (keep-number-reasonable (Math/sin (stack-ref :float 0 state)))
            :float
            (pop-item :float state))
          state)))

    This follows a generally useful pattern of first checking to see if the conditions for executing the instruction are met — in this case, that means checking to make sure that there’s at least one thing on the float stack (that is, that it’s not empty) — and then returning either the result of executing the instruction or the original state unchanged. In this particular case executing the instruction means replacing the top float with its sine, which we get by pushing the sine onto the float stack in the state that results from popping the float stack in the original state. Here we compute the sine by grabbing the top item on the float stack in the original state, calling Math/sin on it, and then running it through a utility function called keep-number-reasonable that prevents the generation of problematic numerical values (see the implementation in clojush.clj for details).

  • Sometimes the Clojure “threading” operators like ->> can be handy for running a state through a bunch of modifications. See the definition of integer_fromfloat in clojush.clj for a simple example. You never have to do this; it’s just a convenience if you like the threading style.
  • If you use the utility function top-item to access the top item of a stack, then note that it will return :no-stack-item if there was no top item to return. This will also be returned from stack-ref when it tries to access an empty stack, but note that stack-ref is not safe for invalid positions (like trying to get the second item on a stack with only one item), so you should check that there are enough arguments (e.g. with (count (:boolean state)) to see how many things are on the boolean stack) and avoid calling stack-ref if it might be looking too deep.
  • You will notice that many of the Push instructions defined in clojush.clj are actually defined with the help of a higher-order function that returns the necessary fn form when given additional arguments, e.g. for the type on which you want to operate. For example, look at the definitions of exec_dup and all of the other dup instructions. Rather than writing all of the code for checking if the necessary argument is there, and then duplicating it if it is there (but returning the state unchanged if it is not), over again for each type, I wrote a higher-order utility called “duper” that takes a Push type and returns a function with all of the right stuff for the appropriate stack. That means that I only had to write the guts of the instruction once, and that I could define dup for each new type with a simple new statement like (define-registered boolean_dup (duper :boolean)). This is one of the nice things about working in a Lisp-like language, but it’s just a convenience and you could define all of the Push instructions that you need without using this trick.

Clojush bug fix, mux example

March 27, 2011
by Lee Spector (lspector)

An important bug fix and a new example in Clojush.

Available as usual from: https://github.com/lspector/Clojush

20110322: - Tag reference bug fixed in closest-association (>= changed to <=).
          - Added mux (multiplexer) example (a couple of approaches in one file).

Monads and Push

March 19, 2011
by Lee Spector (lspector)

I just watched a tutorial on monads (http://www.vimeo.com/20717301), and I was thinking about it in the context of the fact that Push instructions in some versions of Push (e.g. clojush) are all implemented as functions that take and return complete Push interpreter states. Very monad-like, I think. Mulling this a bit more I think there’s probably a nice way to use related ideas to provide a more elegant and uniform approach to several issues that have come up recently, including:

  • Protected state, as raised by Geoffrey Romer on the Push list and also addressed by Maarten Keijzer here.
  • The various uses to which the “auxiliary” stack has been put in some versions of Push, for problem inputs and other non-stack-like kinds of storage (e.g. tag bindings).
  • The handling of abnormal conditions (insufficient arguments, maybe semantic errors like division by zero).
  • More radical changes to the stack-based paradigm, e.g. Kyle Harrington’s experiments with a version of Push using bi-directional tapes instead of stacks.

I had previously mulled using the name Mush for a version of Push based around Maps (as in “hash maps”) rather than stacks, with stacks being just one kind of value to which something could map. In fact that’s what’s going on under the hood in clojush anyway. But maybe with some monad concepts mixed in the M in Mush could stand for “monad” instead (or in addition).

I haven’t thought this through to its logical conclusion; I’m sharing it in case anyone else would like to…