Link Search Menu Expand Document


Aggregated experiments and results

Table of Contents

  1. Autoencoding
    1. Vanilla Autoencoding
    2. VQ-AE-GAN
  2. Geometric Scene Similarity
    1. Simple Shape Counting
    2. Pushing the Limits of Language
    3. Progressive Language Expansion
    4. Alec Mode
    5. Variable Length Sequences
    6. Out of Distribution Prediction
    7. Complicating the Generation Unit
    8. Complicating the Visual Unit
    9. Softmax-Argmax Quantizer
    10. Random Sampler
    11. Gumbel-Softmax Sampler
    12. Sparsity Restraint on Visual Unit


Vanilla Autoencoding


Geometric Scene Similarity

Simple Shape Counting

tl;dr: Perhaps somewhat unsurprisingly, the model learns to count via token-number association if we restrict its vocabulary to \(N\) tokens and the number of objects on a screen \(\in [1, N]\).

All experiments use a language defined by {seq_len=1, vocab_size=10}; a dataset defined by 64x64x3 images, outline and rotation enabled, and {min_shapes=1, max_shapes=10}; and a model trained for 1000 batches of 256 samples with a DLSM architecture (see specifics in the linked full results).

Variation 1: Single Shape Counting. There is only one shape (square) and one color (red). Reaches 0.106625 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10.

Full results here.

Variation 2: Varied Shape Counting. There are three shapes (circle, square, triangle) and one color (red). Reaches 0.267393 BCE. Each of the 10 tokens becomes reliably associated with a certain number of objects from 1 to 10. As expected, there is some error for larger numbers of objects due to overlap.

Full results here.

Variation 3: General Object Counting. There are three shapes (circle, square, triangle) and three colors (red, green, blue). Reaches 0.340383 BCE. Each of the 10 tokens becomes somewhat reliably associated with a certain number of objects from 1 to 10. There is more error in exact counting for large shape numbers as in Variation 2, but some of the counting is imperfect for smaller object counts, too.

Full results here.

Pushing the Limits of Language

What is the relationship between a combination of permitted {vocabulary size, sequence length} and the performance?

Preliminary findings:

  • 4 tokens seems to be the minimum vocabulary size for decent performance (without varying length).
  • Increasing sequence length can actually have deleterious effects
  • The model generally performs well when the vocabulary size is large and the sequence length is small

Progressive Language Expansion

Idea: begin with a very simple setup (e.g. just blue squares), then slowly introduce new attributes (e.g. blue squares, triangles, circles; then all combinations of {blue, red, blue} and {squares, triangles, circles}) and observe if language is retained and how it adapts to new environmental stimulus.

Alec Mode

Vanilla Alec Mode

Strong Alec Mode

Variable Length Sequences

Out of Distribution Prediction

  • The network seems to be capable of generating novel tokens.

Complicating the Generation Unit

  • Using LSTMs vastly outperforms GRUs. To use LSTMs, set the cell state to the generated image latent vector \(z\) (output of the visual unit) and the initial hidden state to a random vector (‘sampling’ speech).
  • Using Bidirectional LSTMs yields slightly better performance. Still need to test impact on quality of generated words.

Complicating the Visual Unit

  • Increasing the number of convolutional layers helps improve performance.
  • Capsule network - too complicated and not worth it as it takes up all the task that the language is supposed to perform. Maybe a good benchmark though.

Softmax-Argmax Quantizer

  • Using the Softmax-Argmax sampler with a double-LSTM speaker and listener performs as well as using the VQ-VAE-style quantizer. There is a slightly larger number of unique generated sequences. Still need to test language quality.
  • Softmax-argmax quantizer reaches godly performance on a 3-shape-type, 3-color, 3-objects Alec mode task.

Random Sampler

Gumbel-Softmax Sampler


Sparsity Restraint on Visual Unit

Output must be somewhat sparse.

I2 - Fusing neuroscience and AI to study intelligent computational systems. Contact us at