learning and development

Mike Cole (mcole who-is-at weber.ucsd.edu)
Wed, 18 Oct 1995 19:24:03 -0700 (PDT)

>From marni who-is-at salk.edu Wed Oct 18 19:15:18 1995
CS200 Week 5 summary, by Marni Stewart Bartlett

Jeff Elman, (1993). "Learning and Development in neural
networks: The importance of starting small." Cognition 48: 71-99.

Elman presents an account of how a long period of development may allow
learning to be most effective. He demonstrates that there are
circumstances under which connectionist models only work when they are
forced to start small and undergo a developmental change which resembles
the increase in working memory seen in children. In part II he discusses
specific shortcomings of the learning process which are compensated for by
restricting capacity during early learning. These issues are especially
pertinent in complex problem domains such as language.

I. Simulations demonstrating the importance of starting small

Simulation 1: Incremental input

Elman reports the effects of staged input on learning grammatical
structure. The network fails to learn the task when the entire data set is
presented all at once, but succeeds when the data are presented
incrementally.

The network is a simple recurrent network. The network was trained to take
one word at a time from a sentence and predict what the next word would
be. Because the predictions depend on grammatical structure, the prediction
task forces the network to develop internal representations which encode
the relevant grammatical information. The input consisted of a corpus of
sentences such as "boys who chase dogs see girls." Each word was encoded as
a vector of zero's with a 1 at a given location. Elman was looking for the
network to learn a. number agreement between subject and verb, b., whether
verbs required or permitted direct objects, and c. correct agreement
despite multiple embeddings.

When the network was trained from the outset with a full "adult" language
containing both simple and complex sentences, it failed to master even the
training data. Elman then repeated the training, increasing the complexity
of the input in five phases starting with all simple sentences and moving
gradually to all complex. This time the network learned the training data
and also performed well with novel sentences.

Simulation 2: Incremental memory

Elman next shows how similar effects can be obtained by the more realistic
assumption that the input is held constant, but the learning mechanism
itself undergoes developmental changes.

In the network, the analog of memory is supplied by the network's access
(via recurrent connections) to its own prior internal states. Temporal
windows for processing can be created by eliminating the feedback after
every n words. In this simulation, the network was trained from the outset
with the full adult language, but working memory was increased in five
phases, from 4-5 words in phase 1, to 6-7 words in phase 4, and unlimited
in phase 5. Again, the network performed well on both the training data and
the novel data.

This is what's going on:

In a sentence like "The girl who the dogs that I chased down the block
frightened, ran away," the evidence that the verb "frightened" is transitive
is obscure. If the network is presented with the full adult corpus from the
outset, it comes up with a solution that works most of the time, but it is
not tied to the true sources of variance which are the grammatical
factors. With a limited temporal window, the effective data are just simple
sentences plus noise. By selectively focusing on simple sentences, the
network appears to learn the basic distinctions, such as grammatical
category, number, and transitivity, which form the basis for later learning
of more complex interactions involving long-distance dependencies. When
learning advances, all additional changes are constrained by this early
commitment to the basic grammatical factors.

II. How Networks Learn

The starting small effect arises as a consequence of fundamental properties
of learning in connectionist models. These same properties can have the
opposite effect depending on the problem. e.g. In XOR, starting small can
result in the network learning the wrong generalization.

The basic argument goes like this: The inputs during the early stages of
learning are very important because at that time, the network is in its
most sensitive and malleable state. The input size is necessarily small and
the network can easily fall into incorrect solutions. If it gets off on
the wrong track, it may not be possible to reach the correct hypothesis
later in learning.

Property 1: The problem of sample size. If the sample size is too small,
there can be many solutions that are compatible with the data set, but are
not a general solution to the problem.

Property 2: No explicit access to previous data. The network has access to
previous data only implicitly in the form of the current weight settings.
Old data therefore can't be used to generate a new hypothesis if the
present one is bad.

Property 3: Constraints on new hypotheses. With gradient descent learning,
transitions to new hypotheses are small and continuous. The network can
only find the correct solution if there is a downhill path to that solution
on the error surface from where it is now. It can't simply jump to
radically different hypotheses.

Property 4: The ability to learn changes over time. The sigmoidal transfer
function causes the network to both lose its sensitivity over time and lose
its ability to learn over time.
a. Sensitivity loss. In the early stages of learning, the net
input to units tends to be near zero because of the random weights. In
later stages, the response saturates to the top or bottom of the sigmoid
and the network will be blind to small changes in the input that it hasn't
already learned to discriminate.
b. Plasticity loss. When units are in the flat regions of the
sigmoid, the weights can't change. The partial derivative of the error
function w.r.t. the weights produces a term that depends on the derivative
of the transfer function. If that derivative is near zero, then the weight
change is near zero.

* * *

Early learning constrains the solution space to a much smaller region.
This is important in complex domains such as language in which there are
many false solutions. Initial memory limitations act as a filter on the
input and focus learning on a subset of facts which constrain later
solutions. The noisiness of the immature nervous system can also retard
learning until the sample size is much larger and more accurate
generalizations can be made.

In the introduction, Elman proposes that developmental changes may be a
factor in accounting for the "projection problem" in language
acquisition. The "projection problem" is that the data available to the
language learner may be insufficient to uniquely determine the grammar,
since there are no negative examples in spoken language. He doesn't
explain this explicitly, but we can infer the following: The maturational
process can constrain the solution for the correct grammar by reducing the
search space, thereby making it possible to uniquely determine the grammar
from only positive examples.