In the late 1980’s, a group at Bellcore doing re-
search on information retrieval techniques developed a
Latent Semantic Analysis (LSA) is a statisti-
statistical, corpus-based method for retrieving texts.
cal, corpus-based text comparison mechanism
Unlike the simple techniques which rely on weighted
that was originally developed for the task of
matches of keywords in the texts and queries, their
information retrieval, but in recent years has
method, called Latent Semantic Analysis (LSA), cre-
produced remarkably human-like abilities in a
ated a high-dimensional, spatial representation of a cor-
variety of language tasks. LSA has taken the
pus and allowed texts to be compared geometrically.
Test of English as a Foreign Language and per-
In the last few years, several researchers have applied
formed as well as non-native English speakers
this technique to a variety of tasks including the syn-
who were successful college applicants. It has
onym section of the Test of English as a Foreign Lan-
shown an ability to learn words at a rate sim-
guage [Landauer et al., 1997], general lexical acquisi-
ilar to humans. It has even graded papers as
tion from text [Landauer and Dumais, 1997], selecting
reliably as human graders. We have used LSA
texts for students to read [Wolfe et al., 1998], judging
as a mechanism for evaluating the quality of
the coherence of student essays [Foltz et al., 1998], and
student responses in an intelligent tutoring sys-
the evaluation of student contributions in an intelligent
tem, and its performance equals that of human
tutoring environment [Wiemer-Hastings et al., 1998;
raters with intermediate domain knowledge. It
1999]. In all of these tasks, the reliability of LSA’s judg-
has been claimed that LSA’s text-comparison
ments is remarkably similar to that of humans.
abilities stem primarily from its use of a statis-tical technique called singular value decompo-
The specific source of LSA’s discriminative power is
sition (SVD) which compresses a large amount
not exactly clear. A significant part of its processing
of term and document co-occurrence informa-
is a type of principle components analysis called singu-
tion into a smaller space. This compression is
lar value decomposition (SVD) which compresses a large
said to capture the semantic information that
amount of co-occurrence information into a much smaller
is latent in the corpus itself. We test this claim
space. This compression step is somewhat similar to the
by comparing LSA to a version of LSA with-
common feature of neural network systems where a large
out SVD, as well as a simple keyword matching
number of inputs is connected to a fairly small number
of hidden layer nodes. If there are too many nodes, anetwork will “memorize” the training set, miss the gen-
eralities in the data, and consequently perform poorlyon a test set.
Although classical Natural Language Processing tech-
of text (on the order of magnitude of a book).
niques have begun to produce acceptable performance
corpus is turned into a co-occurrence matrix of terms
on real world texts as shown in the Message Understand-
by “documents”, where for our purposes, a document is
ing Conferences [DARPA, 1995], they still require huge
a paragraph. SVD computes an approximation of this
amounts of painstaking knowledge engineering and are
data structure of an arbitrary rank K. Common values
fairly brittle in the face of unexpected input. Recently,
of K are between 200 and 500, and are thus considerably
corpus-based statistical techniques have been developed
smaller than the usual number of terms or documents in
in the areas of word-tagging and syntactic grammar in-
a corpus, which are on the order of 10000. It has been
ference. But these techniques are not aimed at providing
claimed that this compression step captures regularities
in the patterns of co-occurrence across terms and across
∗This work was supported by grant number SBR 9720314
documents, and furthermore, that these regularities are
from the National Science Foundation’s Learning and Intel-
related to the semantic structure of the terms and doc-
In this paper, we examine this claim by comparing
number of Propositions that achieved an above-
several approaches which assess the quality of student
threshold LSA cosine with one of the expected
contributions in an intelligent tutoring situation. We
use human judgments of quality as a baseline, and com-
Loosely speaking, this is the percentage of the student’s
pare them to three different models: the full LSA model,
contribution that sufficiently matched the expected an-
a version of LSA without SVD, and a simple keyword-
matching mechanism. The paper starts with a descrip-
The test set for this task was based on eight ques-
tion of the quality judgment task, and describes how
tions from each of the three tutoring topics. Students
LSA was used to rate the contributions. In section 3, we
in several sections of a university-level computer literacy
describe the implementation of LSA without SVD, and
course were given extra credit for typing in answers to
compare it to the SVD results. In section 4, we compare
the questions in a word processing document. They were
these to a basic keyword matching algorithm which used
encouraged to write complete, thorough answers to the
both a weighted and an unweighted matching technique.
questions. Eight substantive (i.e. not “I don’t know”)
We close with a discussion of these results.
answers were randomly selected for each of the 24 ques-tions, constituting a test set of 192 items.
Evaluating student contributionquality with LSA
To provide a baseline description against which the alter-
To assess the depth of knowledge that LSA uses, human
native methods can be judged, this section describes the
raters of different levels of experience with the subject
rating task for both the humans and LSA, gives some
matter were used. Two raters, a graduate student and a
technical details of the LSA implementation, and de-
research fellow, were computer scientists with high levels
scribes how it performed in relation to the human raters.
of knowledge of the computer literacy domain. Two ad-ditional raters, a graduate student and professor in Psy-
chology, had intermediate-level knowledge. They werefamiliar with all of the text materials from the computer
As the litmus test for the various evaluation techniques,
literacy domain that were used in the project.
we have chosen the domain of an intelligent tutoring sys-
The human raters were asked to break the student
tem called AutoTutor that was developed with the goal
responses into propositions, i.e. parts that could stand
of simulating natural human-human dialogue [Wiemer-
on a six-point scale the percentage of each student’s
propositions that “matched” part of the ideal answer.
structure for AutoTutor was a curriculum script [Put-
They were not instructed as to what should constitute
nam, 1987] that contained 12 questions in each of three
a match. The correlation between the two expert raters
different topics: computer hardware, operating systems,
was r=0.78. Between the intermediate knowledge raters,
and the internet. For each question in the curriculum
the correlation was r=0.52. The correlation between the
script, there was a variety of information about expected
average expert rating and the average intermediate rat-
student answers and possible follow-up dialogue moves.
ing was r=0.76. All of the correlations were significant
The questions were designed to be deep reasoning ques-
tions which for which a complete answer would cover sev-eral aspects. AutoTutor’s curriculum script contained
an expected good answer for each of the aspects of a
We briefly describe the LSA mechanism here in order
question, as well as a prompt, hint, and elaboration that
to demonstrate the difference between it and the other
could potentially elicit that answer. The use of these
approaches. Further technical details about LSA can be
dialogue moves was based on studies of human tutors[
found in [Deerwester et al., 1990; Landauer and Dumais,
1997] and several of the articles in the 1998 special is-
which move to use based on the student’s ability and on
sue of Discourse Processes on quantitative approaches to
which expected good answers were already covered. LSA
was the primary mechanism for determining that cover-age based on comparisons between the student responses
As mentioned above, the basic input to LSA is a large
and the expected good answers. When a particular con-
corpus of text. The computer literacy corpus consisted oftwo complete computer literacy textbooks, ten articles
tribution achieved a cosine match above an empiricallydetermined threshold, that aspect of the question was
on each of the tutoring topics, and the entire curricu-
considered as covered for the purposes of the tutoring
lum script (including the expected good answers). Eachcurriculum script item counted as a separate document,
task. This approach led to the definition of the basicevaluation measure:
and the rest of the corpus was separated by paragraphsbecause they tend to describe a single complex concept.
The entire corpus was approximately 2.3 MB of text.
where Propositions is the number of speech acts
LSA defines a term as a word (separated by whitespace
in the student contribution, and Matches is the
or punctuation) that occurs in at least two documents. rate 0.5 an Threshold
Figure 1: The correlation between LSA quality judgments and those of human raters.
There is also a list of about 400 very frequent words
(“the”, “and”, and “for”, for example) that are not used
As previously mentioned, LSA has several attributes
as terms. As previously mentioned, LSA creates from
that may be responsible for its ability to make effec-
this corpus a large co-occurrence matrix of documents by
tive similarity judgments on texts. In addition to the
terms, in which each cell is the number of times that that
compression/generalization provided by the SVD calcu-
term occurred in that document. Each cell is then mul-
lation, LSA might get its benefits from its initial rep-
tiplied by a log entropy weight which essentially reduces
resentation of word “meaning” as a vector of the docu-
the effect of words which occur across a wide variety of
ments that it occurs in. Before the SVD processing, this
contexts (more about this later). SVD then creates a K-
representation is modified by an information theoretic
dimensional approximation of this matrix consisting of
weighting of the elements, which gives higher weights to
three matrices: a D by K documents matrix, a K by T
terms that appear distinctively in a smaller number of
terms matrix, and a K by K singular values (or eigenval-
texts, and lower weights to terms that occur frequently
ues) matrix (D is the number of documents, and T is the
across texts. The comparison of texts using the cosine
number of terms). Multiplying these matrices together
measure on such vectors might also be responsible for
results in an approximation to the original matrix. Each
such good performance. To test how much discrimina-
column of the terms matrix can be viewed as a K-long
tive power LSA gains from SVD, we implemented a ver-
vector representing the “meaning” of that term. Each
sion of LSA without SVD. This section describes the im-
row of the documents matrix can be seen as a K-long
plementation and evaluation of this mechanism, and re-
vector representing the meaning of that document. Fur-
lates it to the evaluation of the standard LSA approach.
thermore, each document vector equals the sum of thevectors of the terms in that document.
The LSA mechanism for AutoTutor works by calculat-
ing the vectors for the student contributions and com-
To create this model, we started with the documents by
paring them to the document vectors for the expected
terms co-occurrence matrix after the information the-
good answers using the cosine metric. Empirical analy-
oretic weighting and before the SVD processing.
ses of the corpus size, the number of dimensions, and the
took the columns of this matrix as a representation of
thresholds showed that the LSA mechanism performed
the meaning of each term. Because there were over 8000
best with the entire corpus described above, and with
documents in the corpus and most terms occur in a small
200 dimensions in the LSA space. Figure 1 shows the
number of documents, this is a very sparse representa-
correlations between the LSA ratings and the average of
tion. Still, it is possible to compare these vectors using
the human ratings over a variety of cosine match thresh-
the cosine metric. Two terms which occur in exactly
The correlation between LSA and the humans
the same set of documents would have a cosine of 1.0.
approaches that between the human raters. Although
Terms which occur in disjoint sets of documents have a
a slightly higher correlation was achieved with a 400-
cosine of 0. It is also possible with this representation
dimension LSA space, this increased performance was
to compute a document vector by adding the vectors of
limited to a single threshold level. This was interpreted
the terms in the document. However, it is not possible to
as a potential outlier, and the 200 dimension space, with
construe the rows in the co-occurrence matrix as the vec-
its relatively flat performance across thresholds, was pre-
tors representing document meaning because they have
a different rank (the number of terms in the corpus) and
tion with 0.15 Threshold
Figure 2: The correlation between LSA without SVD and human raters.
because there is no reason to equate a pattern of term
occurrence (the terms are alphabetized in the represen-
tation) with a pattern of document occurrence. Thus,
The variable wt is the weight for a particular term. We
we had to calculate vectors not just for the student con-
tested this keyword approach using both a 1.0 weight for
tributions but for the expected good answer documents
all terms, and also using the information theoretic weight
calculated by LSA. The keyword match is essentially the
sum of the weights for each keyword that occurs in boththe student contribution and the expected good answer,
After these vectors were computed, the evaluation was
divided by the maximum number of keywords in these
done in exactly the same way as the evaluation of the
two texts. As in the other evaluations, we correlated
the performance of the metric at a range of different
tween the average of the humans’ ratings and the non-
threshold levels as described in the next section.
SVD model. It is clear that the combination of the dis-tributed, weighted vectors and the geometrical compar-
isons were sufficient to produce judgments approachingthose of the full LSA model. The maximum performance
In our first evaluation of the keyword model, we used
here is r = 0.43. As a reminder, the maximum perfor-
the same set of thresholds as in the non-SVD evalua-
mance of the full LSA model was r = 0.48. The max-
tion, namely from 0.95 down to 0.05 in 0.05 increments.
imum performance in this case, however, occurs at just
This resulted in somewhat of a floor effect in the testing
one threshold. For the 200-dimension LSA model, there
however. The LSA weights for terms varied from about
was fairly stable performance across several thresholds.
0.3 to 1, but the highest values were only for very rareterms. Thus, most KM values for the weighted approach
were relatively low, reaching a maximum of around .35,
Because the performance of the non-SVD algorithm was
so we also ran the analysis on a set of thresholds from
so close to that of the full LSA implementation, we de-
0.38 down to 0.02 in 0.02 increments.
cided to evaluate a simple keyword-based approach for
Figure 3 shows the correlations with the human rat-
this task. This section describes the implementation and
ings for the unweighted keyword model, and both thresh-
old sets for the the weighted model.
threshold labels do not correspond to the actual thresh-
olds for the 0.38 to 0.02 threshold set. The actual thresh-
To compare texts with a keyword-matching approach,
olds, however, are not important. The general shape of
we used the same segmentation of the student contri-
the curve is a fairly clear indicator of the behavior of
bution, the same set of expected good answers for each
question, and the same set of terms (as keywords) as in
The most striking feature of this experiment is the
the other approaches. We used the same Compatibility
peak correlation of r=0.47 shown by the weighted model
measure (Matches / Propositions) that we used for LSA.
at the 0.08 threshold level. This is almost equivalent to
To determine the extent to which a student contribution
the maximum performance of the full LSA model. Sim-
speech act S matched an expected good answer E, we
ilar to the 400-dimension LSA model and the no-SVD
defined the keyword match, KM, as follows:
model described earlier, however, this point appears to
on wi 0.15 ati rrel 0.05 Thresholds
Figure 3: Performance of the keyword matching technique.
be an outlier that would be unlikely to apply across an-
other test set, because it is significantly higher than the
LSA gets its power from a variety of sources: the
neighboring thresholds, which display a fairly flat curve.
corpus-based representation of words, the information
We are comfortable in claiming that the simple key-
theoretic weighting, the use of the cosine to calculate dis-
word model can achieve a reliable correlation of r = 0.40
tances between texts, and also SVD. SVD should make
with the human raters, with the weighted model show-
LSA more robustly able to derive text meaning when
ing a relatively flat contour across a range of thresholds.
synonyms or other similar words are used. This may
This level of performance is quite close to that shown by
be reflected by the wider range of thresholds over which
the LSA without SVD model, and within about 20% of
LSA performance remains relatively high.
the performance of the full LSA model. Given the large
Even though LSA without SVD seems to perform
difference in computational resources required to calcu-
fairly well, it must be noted that the use of SVD results
late the keyword approach (the terms and their weights
in a very large space and processing time advantage by
are simply accessed in a hash table), such an approach to
drastically reducing the size of the representation space.
text comparison could be beneficial when computational
If we took LSA without SVD as the original basis for
resources are more important than getting the most re-
comparison, and then discovered the advantages of SVD
with its ability to “do more with less”, it would clearlybe judged superior to the non-SVD LSA model.
Although the computation of the keyword match was
fairly simple, it must be noted that the information the-
It should also be noted that this task is rather dif-
oretic approach used in the weighted keyword model
ficult for LSA. It has been previously shown that LSA
came from the two-textbook corpus that was used for
does better when it has more text to work with [Rehder
LSA. Collecting this amount of text was a daunting task,
et al., 1998], with relatively low discriminative abilities
but alternative term weights could be calculated from a
in the 2 – 60 word range, and steadily climbing perfor-
smaller corpus or from an online lexicographic tool like
mance for more than 60 words. In fact, other researchers
have reported that in short-answer type situations, LSAacts rather like a keyword matching mechanism. It isonly with longer texts that LSA really distinguishes it-self (Walter Kintsch, personal communication, January,
1999). Because the student texts in this study are rela-tively short (average length = 18 words), LSA had less
In this paper we addressed the question of the contri-
information on which to base its judgments, and there-
bution of the compression step of SVD to LSA, and we
fore, its abilities to discriminate were reduced. It is pos-
compared LSA to a simple keyword-based mechanism in
sible that with longer texts there would be more of a
evaluating the quality of student responses in a tutor-ing task. We showed that although the performance of
1Similar results of a relatively small effect of SVD on a
the full LSA model was superior to the reduced models,
different corpus were reported by Guy Denhi`
these alternatives approached the discriminative power
difference between the performance of LSA and the al-
American Society for Information Science, 41:391–
ternative methods presented here. On the other hand,
we must also point out that this lack of text seems to
[Fellbaum, 1998] C. Fellbaum. WordNet: An electronic
have hurt the human raters’ abilities to discriminate as
lexical database. MIT Press, Cambridge, MA, 1998.
well, resulting in fairly low inter-rater reliability scores.
The results presented here do not mitigate the promise
[Foltz et al., 1998] P. W. Foltz, W. Kintsch, and T. K.
of such corpus-based, statistical mechanisms as LSA.
Landauer. The measurement of textual coherence with
They suggest, however, that more research is needed to
latent semantic analysis. Discourse Processes, 25:285–
further tease apart the strengths of the various aspects
of such an approach. In future research, we will remove
[Graesser et al., 1995] A. C. Graesser, N. K. Person, and
the information theoretic weighting from the non-SVD
J. P. Magliano. Collaborative dialogue patterns in nat-
model to determine how well the system can perform by
uralistic one-to-one tutoring. Applied Cognitive Psy-
In conclusion, if you want a text evaluation mecha-
[Landauer and Dumais, 1997] T.K. Landauer and S.T.
nism based on comparisons, and if you have a good set
Dumais. A solution to Plato’s problem: The latent se-
of texts as a basis of comparison, you have several op-
mantic analysis theory of acquisition, induction, and
tions. A simple keyword match performs surprisingly
well, and is relatively inexpensive computationally. A
mechanism like the no-SVD model presented here doesnot produce better maximum performance than the key-
[Landauer et al., 1997] T. K. Landauer, D. Laham,
word model on these relatively short texts, but it does
R. Rehder, and M. E. Schreiner. How well can pas-
produce good performance across a range of thresholds,
sage meaning be derived without using word order? a
indicating a robustness to be able to handle a variety
comparison of Latent Semantic Analysis and humans.
of inputs. The full LSA model exceeds both the per-
In Proceedings of the 19th Annual Conference of the
formance and the robustness of both of these models,
Cognitive Science Society, pages 412–417, Mahwah,
achieving results comparable to those of humans with in-
termediate domain knowledge. Because the initial goal
[Putnam, 1987] R. T. Putnam. Structuring and adjust-
of the AutoTutor project is to simulate a normal hu-
ing content for students: A study of live and simulated
man tutor that has no special training but nevertheless
tutoring of addition. American Educational Research
produces significant learning gains, we are happy with
this level of performance. In future research, we will
[Rehder et al., 1998] B. Rehder, M. Schreiner, D. La-
address the possibility of combining structural analysis
ham, M. Wolfe, T. Landauer, and W. Kintsch. Using
of the student texts with LSA’s semantic capabilities.
Latent Semantic Analysis to assess knowledge: Some
This may hold the key to approaching the performance
technical considerations. Discourse Processes, 25:337–
of expert human raters in this task.
P. Wiemer-Hastings, A. Graesser, D. Harter, and the
This work was completed with the help of Katja Wiemer-
Tutoring Research Group. The foundations and archi-
Hastings, Art Graesser, Roger Kreuz, Lee McCaulley,
tecture of AutoTutor. In B. Goettl, H. Halff, C. Red-
Bianca Klettke, Tim Brogdon, Melissa Ring, Ashraf An-
field, and V. Shute, editors, Intelligent Tutoring Sys-
war, Myles Bogner, Fergus Nolan, and the other mem-
tems, Proceedings of the 4th International Conference,
bers of the Tutoring Research Group at the University
pages 334–343, Berlin, 1998. Springer.
of Memphis: Patrick Chipman, Scotty Craig, Rachel Di-Paolo, Stan Franklin, Max Garzon, Barry Gholson, Doug
Hacker, Xiangen Hu, Derek Harter, Jim Hoeffner, Jeff
K. Wiemer-Hastings, and A. Graesser. Improving an
Janover, Kristen Link, Johanna Marineau, Bill Marks,
intelligent tutor’s comprehension of students with La-
Michael Muellenmeister, Brent Olde, Natalie Person,
tent Semantic Analysis. In Proceedings of Artificial In-
Victoria Pomeroy, Holly Yetman, and Zhaohua Zhang.
telligence in Education, 1999, Amsterdam, 1999. IOS
We also wish to acknowledge very helpful comments on
a previous draft by three anonymous reviewers.
[Wolfe et al., 1998] M. Wolfe, M. E. Schreiner, B. Re-
hder, D. Laham, P. W. Foltz, W. Kintsch, and T. K.
Landauer. Learning from text: Matching readers and
[DARPA, 1995] DARPA. Proceedings of the Sixth Mes-
texts by Latent Semantic Analysis. Discourse Pro-
sage Understanding Conference (MUC-6).
Kaufman Publishers, San Francisco, 1995.
[Deerwester et al., 1990] S. Deerwester, S. T. Dumais,
G. W. Furnas, T. K. Landauer, and R. Harshman. In-dexing by latent semantic analysis.
Dipartimento di Biologia Animale Piazza Botta 9-10 Pavia • Telefono: 0382 506289 • Fax: 0382 506290 • E-Mail: [email protected] Direttore: prof. Mauro Fasola DESCRIZIONE Il Dipartimento di Biologia Animale è sorto dalla confluenza di tre Istituti: Anatomia Comparata; Istologia, Embriologia eAntropologia; Zoologia, con la successiva afferenza dell’Istituto di Entomologia. Ad esso f
Rubem Braga, Antônio Bôto e o Vernéculoser famoso. Era um grande poeta, assim qualifi-cado pelos melhores críticos e, entre nós, pordo-me a passar uma temporada com ele. Manuel Bandeira. Grande poeta, também, paraseu amigo íntimo e confidente Fernando Pessoa,da vida. “Nada, nada de novo, rien de rien ”,que publicou na sua editora (dele, Pessoa), aescrevia, citando Edith Piaf.