Peli’s Vibes part 2: Autoencoders!

Okay, and we’re back discussing Peli Grietzer‘s Theory of Vibe. Or, more importantly, Peli’s Theory of Vibe as I see it! Because he’s already explained it all quite a bit.

Imitation is the Highest Form of Interpretation

Up until grad school, my main method for learning math was imitation and pattern recognition (and as far as I’m concerned, I haven’t learned much math since this stopped being possible/effective). I learned calculus without any deep understanding just because I could reliably replicate the techniques. In school we are taught that we have to learn and that we should be original. We learn that “copying” and “imitating” are bad, but nobody really takes the time to explore the fact that imitation is a fundamental learning technique.

Stop copying me while I’m telling you the theory of
human speech.

Peli reminds us that when artists create imitations of worldly objects, we consider these to be interpretations of those objects, or of the world itself.

I learn through my hands; I see the world through my hands.

And furthermore, if you were able to faithfully recreate the artist’s rendering of these objects, surely you have learned something of the artist’s worldview, and in turn something of the world. In what follows we will delve into the machine learning side of things, looking at how “autoencoders” learn to imitate the world they are presented with, and how the process they undergo naturally produces a set of “privileged objects” and a sort of “worldview.” This gives us more than an analogy to the relationship between a work of art, the privileged objects in the work, and an artist’s worldview. What we get for our troubles is in fact a formal mathematical basis for this relationship.

Autoencoders!!! (The Basic Mechanics)

Peli has some great explanations of how autoencoders work (with a fun analogy using Greek pottery experts!). I will not recreate those here. I will just give you the final image I have in my mind.

Okay so an autoencoder is a computer-program-thingy (sorry, y’all) which has been set loose on a very large set of inputs (of the same general type) with the goal of learning how to reproduce them (as accurately and efficiently as possible, subject to predetermined constraints). As such, you start with an “untrained autoencoder” which “trains” (learns) on a “training set” and then you end with an older and wiser “trained autoencoder” which you can run on “test data.”

The untrained autoencoder is two functions in a row (I think of a function as a box, an arrow in, and an arrow out) with a third function, uh, on the side (you’ll see). The first function accepts an input (like a digital image) and outputs some sort of code. The second function accepts that code and outputs a digital image based only on that code (remember, the ultimate goal is to reproduce the image). It is useful (I admit begrudgingly) to think of these two functions as one (really the composition of the two) which takes an input and attempts to replicate it. Do this for the entire training set, then give all the input images and attempted replicas to the third function which analyzes how wrong the replicas were and makes a tiny adjustment to the other functions, with the goal of minimizing wrongness. In other words: input images → short codes → attempted replicas → tiny adjustment. Or, inaccurately, in pictures:

This untrained autoencoder doesn’t like my joke.

The process terminates when the third function stops suggesting adjustments. (Exercise: How do I know the process will terminate?) Now you turn off the optimizer and declare yourself to have a trained autoencoder. You should think of a trained autoencoder as having internalized a particular way of seeing things. You can learn about this way of seeing by running it on “test data” (which was not part of the training set).

For the next while, we’re going to throw away our optimizer and think only of a trained autoencoder which has already internalized a way of summarizing and reconstructing that is the best in terms of accuracy and efficiency.

A Little Night’s Mathematics

Okay I am going to state some definitions/facts, but I’m not going to super explain them. If you aren’t a mathematician, don’t freak out, I’ll recap in non-math afterwards. (I mean, freak out if that’s your jam. You do you.)

1.) The feature function of a trained autoencoder turns a concrete object into a vector (the code).

2.) The decoder function of a trained autoencoder turns a vector into a concrete object.

3.) The projection function of a trained autoencoder turns a concrete object into a stand-in concrete object (an approximation or else, itself); it is the composition of the feature function and the decoder function.

4.) The input-space is the set of all (theoretically) possible concrete objects which could be input into the feature function. For instance, all possible collections of images of a certain size. We view this as a solid “cube” (in more than three dimensions, though) in a vector space by arranging the set in terms of the most naive understanding of the independent characteristics. In the case of grayscale images of the same size, for example, each pixel-spot would be a dimension.

5.) The feature space is the “cube” (inside a real vector space) that the feature function maps onto. The dimension of this space is chosen in advance, the more dimensions the more accurate the reproductions will be, but it will have fewer dimensions than the input-space. You can think of each dimension as an aspect of the objects at hand, in which case each vector is a list of numbers which express how much of each aspect an object is. So if the allowed values for each dimension are 0 through 9, then $\langle 0, 4.5, 9 \rangle$ could mean not at all red, neutral on the brightness, and extremely goth.

We thus see that the feature function maps elements of the input-space to vectors in the (lower-dimensional) feature space, the decoder function maps vectors from the feature space to elements in (a subset of) input-space. And the projection function maps elements of input-space back to (neither “into” nor “onto”) input-space.

But what happens when you map the feature space back to input-space?

6.) Peli calls the set of elements in input-space which remain unchanged under the projection function the canon of the trained autoencoder. This is more important than it sounds!!! The existence of a canon follows from the optimization process (though the canon need not be part of the training set), and in fact, the canon is actually the image of the trained autoencoder’s decoder (or projection) function. This means that the canon is all possible replicas the autoencoder could make. The canon, which is just a set of concrete objects, turns out to be logically and/or functionally equivalent to both the feature function and the projection function.

The trained autonencoder sees its canon best.

7.) The canon forms a “lower-dimensional submanifold” inside input-space. What’s more, given a point of input-space close enough to the canon, the projection function is literally the orthogonal projection of that point onto this manifold. The manifold structure thus implies a logical and inextricable relationship between a canon, a feature function, and a projection function that allows us to bring this structure to other realms.

8.) From this manifold perspective, we now have a new notion of distance or comparability.

Manifold interlude. I have avoided learning about manifolds thus far and I’m not going to break that now. So we’re going to stick with what little I know which is that “locally they look like $\mathbb{R}^n$ ” which means whatever that means. We don’t have to know the details now, but I do want to give a brief motivation for the different notions of distance.

Imagine you are walking up one side of an isosceles triangle and down the other. How far did you walk? You could measure the horizontal distance traveled (the length of the base of the triangle), or you could measure what you actually walked (the two sides of the triangle). Additionally, your perception might be that walking up was twice as hard as walking down, so maybe you want to account for that. I don’t know. Either way, these are different metrics you might use when trying to tell someone how far you went, or how far you felt like you went.

just let me catch my breath, you go on without me

As we saw above, in our situation, you can picture input-space as being three-dimensional and picture the canon-submanifold as a two-dimensional object (like a sheet of paper or a wide ribbon) that curves through space. From the external perspective we can measure distance in the usual way, but the submanifold also has its own perspective. So if you have, say, a sphere (an empty ball) in space, you can measure the distance between two points on the sphere by using the straight line that connects them by going through the sphere. This is how you measure distance in the ambient space, but of course if you must be restricted to staying on the sphere (like when we travel to other parts of the world), distance will be measured by the shortest path around the sphere.

Anyway, the point is that the canon of a trained autoencoder has its own notion of distance, which we interpret as its own notion of similarity and difference. Who cares? Remember that points were close in input-space if they represented objects which were naively similar (for instance, two images which are close on a pixel-by-pixel basis), whereas being close in feature space meant being similar in more meaningful ways (for example, close in redness, brightness, and gothness). The manifold perspective allows us to see both types of similarity/difference inside of input-space. We also get that the projection function keeps track of the least input-space distance between an object not in the canon (but sufficiently close) and the canon itself. Being able to see all this just from the canon alone will become particularly important when dealing with art, since all we have access to in terms of an artist’s worldview, is their renderings or interpretations of the world.

What You’re Gonna Wanna Know

Suppose you start with a large set of worldly objects and you want an algorithm to learn the essence of what these objects are about. Well, I don’t know how you’d do this or what it would mean, but whatever it means, I would agree you had succeeded if the algorithm was able to approximately replicate these objects by decoding its own interpretations. If you give these objects to an untrained autoencoder, how it learns to approximate/replicate this world forces it to create a worldview (assessments of how similar or dissimilar the objects of that world are to each other with respect to aspects that are meaningful to them) and a canon (the privileged objects its worldview sees most accurately). Now, a trained autoencoder has already internalized a worldview and can be considered logically and functionally equivalent to its canon. Importantly, this is a consequence of the nature of replication and doesn’t rely on the training process, and furthermore we will see that training an untrained autoencoder on a sample of the canon (the sample being considerably smaller than the original training set) is the most efficient way to pass on a worldview.

A work of art (the objects/phenomena it recreates), Peli argues, can be considered to be a sample from the canon of a trained autoencoder. If the meaning of a work of art is to be found in an artist’s worldview, then it turns out that this meaning is logically equivalent to the objects the artist chose to present. Reading a work of literature, then, is equivalent to an untrained autoencoder training on samples of a trained autoencoder’s canon. And in fact this is the most efficient way to learn an artist’s worldview.

Next installation shall be more about the things!

Relevant writings of Peli Grietzer
Theory of Vibe: a recent article in Glass Bead (a good summary of everything)
Deep Learning, Literature, and Aesthetic Meaning, With Applications to Modernist Studies: Précis of May ’17 HUJI Einstein Institute of Mathematics talk (a lecture aimed at math people)
From ‘A Literary Theorist’s Guide to Autoencoding’ (an excerpt from the first chapter of his PhD thesis, introducing autoencoders)
Ambient Meaning: Mood, Vibe, System (the Phd thesis)