Paper.dvi

In the late 1980’s, a group at Bellcore doing re- search on information retrieval techniques developed a Latent Semantic Analysis (LSA) is a statisti- statistical, corpus-based method for retrieving texts.
cal, corpus-based text comparison mechanism Unlike the simple techniques which rely on weighted that was originally developed for the task of matches of keywords in the texts and queries, their information retrieval, but in recent years has method, called Latent Semantic Analysis (LSA), cre- produced remarkably human-like abilities in a ated a high-dimensional, spatial representation of a cor- variety of language tasks. LSA has taken the pus and allowed texts to be compared geometrically.
Test of English as a Foreign Language and per- In the last few years, several researchers have applied formed as well as non-native English speakers this technique to a variety of tasks including the syn- who were successful college applicants. It has onym section of the Test of English as a Foreign Lan- shown an ability to learn words at a rate sim- guage [Landauer et al., 1997], general lexical acquisi- ilar to humans. It has even graded papers as tion from text [Landauer and Dumais, 1997], selecting reliably as human graders. We have used LSA texts for students to read [Wolfe et al., 1998], judging as a mechanism for evaluating the quality of the coherence of student essays [Foltz et al., 1998], and student responses in an intelligent tutoring sys- the evaluation of student contributions in an intelligent tem, and its performance equals that of human tutoring environment [Wiemer-Hastings et al., 1998; raters with intermediate domain knowledge. It 1999]. In all of these tasks, the reliability of LSA’s judg- has been claimed that LSA’s text-comparison ments is remarkably similar to that of humans.
abilities stem primarily from its use of a statis-tical technique called singular value decompo- The specific source of LSA’s discriminative power is sition (SVD) which compresses a large amount not exactly clear. A significant part of its processing of term and document co-occurrence informa- is a type of principle components analysis called singu- tion into a smaller space. This compression is lar value decomposition (SVD) which compresses a large said to capture the semantic information that amount of co-occurrence information into a much smaller is latent in the corpus itself. We test this claim space. This compression step is somewhat similar to the by comparing LSA to a version of LSA with- common feature of neural network systems where a large out SVD, as well as a simple keyword matching number of inputs is connected to a fairly small number of hidden layer nodes. If there are too many nodes, anetwork will “memorize” the training set, miss the gen- eralities in the data, and consequently perform poorlyon a test set.
Although classical Natural Language Processing tech- of text (on the order of magnitude of a book).
niques have begun to produce acceptable performance corpus is turned into a co-occurrence matrix of terms on real world texts as shown in the Message Understand- by “documents”, where for our purposes, a document is ing Conferences [DARPA, 1995], they still require huge a paragraph. SVD computes an approximation of this amounts of painstaking knowledge engineering and are data structure of an arbitrary rank K. Common values fairly brittle in the face of unexpected input. Recently, of K are between 200 and 500, and are thus considerably corpus-based statistical techniques have been developed smaller than the usual number of terms or documents in in the areas of word-tagging and syntactic grammar in- a corpus, which are on the order of 10000. It has been ference. But these techniques are not aimed at providing claimed that this compression step captures regularities in the patterns of co-occurrence across terms and across ∗This work was supported by grant number SBR 9720314 documents, and furthermore, that these regularities are from the National Science Foundation’s Learning and Intel- related to the semantic structure of the terms and doc- In this paper, we examine this claim by comparing number of Propositions that achieved an above- several approaches which assess the quality of student threshold LSA cosine with one of the expected contributions in an intelligent tutoring situation. We use human judgments of quality as a baseline, and com- Loosely speaking, this is the percentage of the student’s pare them to three different models: the full LSA model, contribution that sufficiently matched the expected an- a version of LSA without SVD, and a simple keyword- matching mechanism. The paper starts with a descrip- The test set for this task was based on eight ques- tion of the quality judgment task, and describes how tions from each of the three tutoring topics. Students LSA was used to rate the contributions. In section 3, we in several sections of a university-level computer literacy describe the implementation of LSA without SVD, and course were given extra credit for typing in answers to compare it to the SVD results. In section 4, we compare the questions in a word processing document. They were these to a basic keyword matching algorithm which used encouraged to write complete, thorough answers to the both a weighted and an unweighted matching technique.
questions. Eight substantive (i.e. not “I don’t know”) We close with a discussion of these results.
answers were randomly selected for each of the 24 ques-tions, constituting a test set of 192 items.
Evaluating student contributionquality with LSA To provide a baseline description against which the alter- To assess the depth of knowledge that LSA uses, human native methods can be judged, this section describes the raters of different levels of experience with the subject rating task for both the humans and LSA, gives some matter were used. Two raters, a graduate student and a technical details of the LSA implementation, and de- research fellow, were computer scientists with high levels scribes how it performed in relation to the human raters.
of knowledge of the computer literacy domain. Two ad-ditional raters, a graduate student and professor in Psy- chology, had intermediate-level knowledge. They werefamiliar with all of the text materials from the computer As the litmus test for the various evaluation techniques, literacy domain that were used in the project.
we have chosen the domain of an intelligent tutoring sys- The human raters were asked to break the student tem called AutoTutor that was developed with the goal responses into propositions, i.e. parts that could stand of simulating natural human-human dialogue [Wiemer- on a six-point scale the percentage of each student’s propositions that “matched” part of the ideal answer.
structure for AutoTutor was a curriculum script [Put- They were not instructed as to what should constitute nam, 1987] that contained 12 questions in each of three a match. The correlation between the two expert raters different topics: computer hardware, operating systems, was r=0.78. Between the intermediate knowledge raters, and the internet. For each question in the curriculum the correlation was r=0.52. The correlation between the script, there was a variety of information about expected average expert rating and the average intermediate rat- student answers and possible follow-up dialogue moves.
ing was r=0.76. All of the correlations were significant The questions were designed to be deep reasoning ques- tions which for which a complete answer would cover sev-eral aspects. AutoTutor’s curriculum script contained an expected good answer for each of the aspects of a We briefly describe the LSA mechanism here in order question, as well as a prompt, hint, and elaboration that to demonstrate the difference between it and the other could potentially elicit that answer. The use of these approaches. Further technical details about LSA can be dialogue moves was based on studies of human tutors[ found in [Deerwester et al., 1990; Landauer and Dumais, 1997] and several of the articles in the 1998 special is- which move to use based on the student’s ability and on sue of Discourse Processes on quantitative approaches to which expected good answers were already covered. LSA was the primary mechanism for determining that cover-age based on comparisons between the student responses As mentioned above, the basic input to LSA is a large and the expected good answers. When a particular con- corpus of text. The computer literacy corpus consisted oftwo complete computer literacy textbooks, ten articles tribution achieved a cosine match above an empiricallydetermined threshold, that aspect of the question was on each of the tutoring topics, and the entire curricu- considered as covered for the purposes of the tutoring lum script (including the expected good answers). Eachcurriculum script item counted as a separate document, task. This approach led to the definition of the basicevaluation measure: and the rest of the corpus was separated by paragraphsbecause they tend to describe a single complex concept.
The entire corpus was approximately 2.3 MB of text.
where Propositions is the number of speech acts LSA defines a term as a word (separated by whitespace in the student contribution, and Matches is the or punctuation) that occurs in at least two documents.
rate 0.5
an
Threshold
Figure 1: The correlation between LSA quality judgments and those of human raters.
There is also a list of about 400 very frequent words (“the”, “and”, and “for”, for example) that are not used As previously mentioned, LSA has several attributes as terms. As previously mentioned, LSA creates from that may be responsible for its ability to make effec- this corpus a large co-occurrence matrix of documents by tive similarity judgments on texts. In addition to the terms, in which each cell is the number of times that that compression/generalization provided by the SVD calcu- term occurred in that document. Each cell is then mul- lation, LSA might get its benefits from its initial rep- tiplied by a log entropy weight which essentially reduces resentation of word “meaning” as a vector of the docu- the effect of words which occur across a wide variety of ments that it occurs in. Before the SVD processing, this contexts (more about this later). SVD then creates a K- representation is modified by an information theoretic dimensional approximation of this matrix consisting of weighting of the elements, which gives higher weights to three matrices: a D by K documents matrix, a K by T terms that appear distinctively in a smaller number of terms matrix, and a K by K singular values (or eigenval- texts, and lower weights to terms that occur frequently ues) matrix (D is the number of documents, and T is the across texts. The comparison of texts using the cosine number of terms). Multiplying these matrices together measure on such vectors might also be responsible for results in an approximation to the original matrix. Each such good performance. To test how much discrimina- column of the terms matrix can be viewed as a K-long tive power LSA gains from SVD, we implemented a ver- vector representing the “meaning” of that term. Each sion of LSA without SVD. This section describes the im- row of the documents matrix can be seen as a K-long plementation and evaluation of this mechanism, and re- vector representing the meaning of that document. Fur- lates it to the evaluation of the standard LSA approach.
thermore, each document vector equals the sum of thevectors of the terms in that document.
The LSA mechanism for AutoTutor works by calculat- ing the vectors for the student contributions and com- To create this model, we started with the documents by paring them to the document vectors for the expected terms co-occurrence matrix after the information the- good answers using the cosine metric. Empirical analy- oretic weighting and before the SVD processing.
ses of the corpus size, the number of dimensions, and the took the columns of this matrix as a representation of thresholds showed that the LSA mechanism performed the meaning of each term. Because there were over 8000 best with the entire corpus described above, and with documents in the corpus and most terms occur in a small 200 dimensions in the LSA space. Figure 1 shows the number of documents, this is a very sparse representa- correlations between the LSA ratings and the average of tion. Still, it is possible to compare these vectors using the human ratings over a variety of cosine match thresh- the cosine metric. Two terms which occur in exactly The correlation between LSA and the humans the same set of documents would have a cosine of 1.0.
approaches that between the human raters. Although Terms which occur in disjoint sets of documents have a a slightly higher correlation was achieved with a 400- cosine of 0. It is also possible with this representation dimension LSA space, this increased performance was to compute a document vector by adding the vectors of limited to a single threshold level. This was interpreted the terms in the document. However, it is not possible to as a potential outlier, and the 200 dimension space, with construe the rows in the co-occurrence matrix as the vec- its relatively flat performance across thresholds, was pre- tors representing document meaning because they have a different rank (the number of terms in the corpus) and

tion with 0.15
Threshold
Figure 2: The correlation between LSA without SVD and human raters.
because there is no reason to equate a pattern of term occurrence (the terms are alphabetized in the represen- tation) with a pattern of document occurrence. Thus, The variable wt is the weight for a particular term. We we had to calculate vectors not just for the student con- tested this keyword approach using both a 1.0 weight for tributions but for the expected good answer documents all terms, and also using the information theoretic weight calculated by LSA. The keyword match is essentially the sum of the weights for each keyword that occurs in boththe student contribution and the expected good answer, After these vectors were computed, the evaluation was divided by the maximum number of keywords in these done in exactly the same way as the evaluation of the two texts. As in the other evaluations, we correlated the performance of the metric at a range of different tween the average of the humans’ ratings and the non- threshold levels as described in the next section.
SVD model. It is clear that the combination of the dis-tributed, weighted vectors and the geometrical compar- isons were sufficient to produce judgments approachingthose of the full LSA model. The maximum performance In our first evaluation of the keyword model, we used here is r = 0.43. As a reminder, the maximum perfor- the same set of thresholds as in the non-SVD evalua- mance of the full LSA model was r = 0.48. The max- tion, namely from 0.95 down to 0.05 in 0.05 increments.
imum performance in this case, however, occurs at just This resulted in somewhat of a floor effect in the testing one threshold. For the 200-dimension LSA model, there however. The LSA weights for terms varied from about was fairly stable performance across several thresholds.
0.3 to 1, but the highest values were only for very rareterms. Thus, most KM values for the weighted approach were relatively low, reaching a maximum of around .35, Because the performance of the non-SVD algorithm was so we also ran the analysis on a set of thresholds from so close to that of the full LSA implementation, we de- 0.38 down to 0.02 in 0.02 increments.
cided to evaluate a simple keyword-based approach for Figure 3 shows the correlations with the human rat- this task. This section describes the implementation and ings for the unweighted keyword model, and both thresh- old sets for the the weighted model.
threshold labels do not correspond to the actual thresh- olds for the 0.38 to 0.02 threshold set. The actual thresh- To compare texts with a keyword-matching approach, olds, however, are not important. The general shape of we used the same segmentation of the student contri- the curve is a fairly clear indicator of the behavior of bution, the same set of expected good answers for each question, and the same set of terms (as keywords) as in The most striking feature of this experiment is the the other approaches. We used the same Compatibility peak correlation of r=0.47 shown by the weighted model measure (Matches / Propositions) that we used for LSA.
at the 0.08 threshold level. This is almost equivalent to To determine the extent to which a student contribution the maximum performance of the full LSA model. Sim- speech act S matched an expected good answer E, we ilar to the 400-dimension LSA model and the no-SVD defined the keyword match, KM, as follows: model described earlier, however, this point appears to

on wi 0.15
ati
rrel 0.05
Thresholds
Figure 3: Performance of the keyword matching technique.
be an outlier that would be unlikely to apply across an- other test set, because it is significantly higher than the LSA gets its power from a variety of sources: the neighboring thresholds, which display a fairly flat curve.
corpus-based representation of words, the information We are comfortable in claiming that the simple key- theoretic weighting, the use of the cosine to calculate dis- word model can achieve a reliable correlation of r = 0.40 tances between texts, and also SVD. SVD should make with the human raters, with the weighted model show- LSA more robustly able to derive text meaning when ing a relatively flat contour across a range of thresholds.
synonyms or other similar words are used. This may This level of performance is quite close to that shown by be reflected by the wider range of thresholds over which the LSA without SVD model, and within about 20% of LSA performance remains relatively high.
the performance of the full LSA model. Given the large Even though LSA without SVD seems to perform difference in computational resources required to calcu- fairly well, it must be noted that the use of SVD results late the keyword approach (the terms and their weights in a very large space and processing time advantage by are simply accessed in a hash table), such an approach to drastically reducing the size of the representation space.
text comparison could be beneficial when computational If we took LSA without SVD as the original basis for resources are more important than getting the most re- comparison, and then discovered the advantages of SVD with its ability to “do more with less”, it would clearlybe judged superior to the non-SVD LSA model.
Although the computation of the keyword match was fairly simple, it must be noted that the information the- It should also be noted that this task is rather dif- oretic approach used in the weighted keyword model ficult for LSA. It has been previously shown that LSA came from the two-textbook corpus that was used for does better when it has more text to work with [Rehder LSA. Collecting this amount of text was a daunting task, et al., 1998], with relatively low discriminative abilities but alternative term weights could be calculated from a in the 2 – 60 word range, and steadily climbing perfor- smaller corpus or from an online lexicographic tool like mance for more than 60 words. In fact, other researchers have reported that in short-answer type situations, LSAacts rather like a keyword matching mechanism. It isonly with longer texts that LSA really distinguishes it-self (Walter Kintsch, personal communication, January, 1999). Because the student texts in this study are rela-tively short (average length = 18 words), LSA had less In this paper we addressed the question of the contri- information on which to base its judgments, and there- bution of the compression step of SVD to LSA, and we fore, its abilities to discriminate were reduced. It is pos- compared LSA to a simple keyword-based mechanism in sible that with longer texts there would be more of a evaluating the quality of student responses in a tutor-ing task. We showed that although the performance of 1Similar results of a relatively small effect of SVD on a the full LSA model was superior to the reduced models, different corpus were reported by Guy Denhi` these alternatives approached the discriminative power difference between the performance of LSA and the al- American Society for Information Science, 41:391– ternative methods presented here. On the other hand, we must also point out that this lack of text seems to [Fellbaum, 1998] C. Fellbaum. WordNet: An electronic have hurt the human raters’ abilities to discriminate as lexical database. MIT Press, Cambridge, MA, 1998.
well, resulting in fairly low inter-rater reliability scores.
The results presented here do not mitigate the promise [Foltz et al., 1998] P. W. Foltz, W. Kintsch, and T. K.
of such corpus-based, statistical mechanisms as LSA.
Landauer. The measurement of textual coherence with They suggest, however, that more research is needed to latent semantic analysis. Discourse Processes, 25:285– further tease apart the strengths of the various aspects of such an approach. In future research, we will remove [Graesser et al., 1995] A. C. Graesser, N. K. Person, and the information theoretic weighting from the non-SVD J. P. Magliano. Collaborative dialogue patterns in nat- model to determine how well the system can perform by uralistic one-to-one tutoring. Applied Cognitive Psy- In conclusion, if you want a text evaluation mecha- [Landauer and Dumais, 1997] T.K. Landauer and S.T.
nism based on comparisons, and if you have a good set Dumais. A solution to Plato’s problem: The latent se- of texts as a basis of comparison, you have several op- mantic analysis theory of acquisition, induction, and tions. A simple keyword match performs surprisingly well, and is relatively inexpensive computationally. A mechanism like the no-SVD model presented here doesnot produce better maximum performance than the key- [Landauer et al., 1997] T. K. Landauer, D. Laham, word model on these relatively short texts, but it does R. Rehder, and M. E. Schreiner. How well can pas- produce good performance across a range of thresholds, sage meaning be derived without using word order? a indicating a robustness to be able to handle a variety comparison of Latent Semantic Analysis and humans.
of inputs. The full LSA model exceeds both the per- In Proceedings of the 19th Annual Conference of the formance and the robustness of both of these models, Cognitive Science Society, pages 412–417, Mahwah, achieving results comparable to those of humans with in- termediate domain knowledge. Because the initial goal [Putnam, 1987] R. T. Putnam. Structuring and adjust- of the AutoTutor project is to simulate a normal hu- ing content for students: A study of live and simulated man tutor that has no special training but nevertheless tutoring of addition. American Educational Research produces significant learning gains, we are happy with this level of performance. In future research, we will [Rehder et al., 1998] B. Rehder, M. Schreiner, D. La- address the possibility of combining structural analysis ham, M. Wolfe, T. Landauer, and W. Kintsch. Using of the student texts with LSA’s semantic capabilities.
Latent Semantic Analysis to assess knowledge: Some This may hold the key to approaching the performance technical considerations. Discourse Processes, 25:337– of expert human raters in this task.
P. Wiemer-Hastings, A. Graesser, D. Harter, and the This work was completed with the help of Katja Wiemer- Tutoring Research Group. The foundations and archi- Hastings, Art Graesser, Roger Kreuz, Lee McCaulley, tecture of AutoTutor. In B. Goettl, H. Halff, C. Red- Bianca Klettke, Tim Brogdon, Melissa Ring, Ashraf An- field, and V. Shute, editors, Intelligent Tutoring Sys- war, Myles Bogner, Fergus Nolan, and the other mem- tems, Proceedings of the 4th International Conference, bers of the Tutoring Research Group at the University pages 334–343, Berlin, 1998. Springer.
of Memphis: Patrick Chipman, Scotty Craig, Rachel Di-Paolo, Stan Franklin, Max Garzon, Barry Gholson, Doug Hacker, Xiangen Hu, Derek Harter, Jim Hoeffner, Jeff K. Wiemer-Hastings, and A. Graesser. Improving an Janover, Kristen Link, Johanna Marineau, Bill Marks, intelligent tutor’s comprehension of students with La- Michael Muellenmeister, Brent Olde, Natalie Person, tent Semantic Analysis. In Proceedings of Artificial In- Victoria Pomeroy, Holly Yetman, and Zhaohua Zhang.
telligence in Education, 1999, Amsterdam, 1999. IOS We also wish to acknowledge very helpful comments on a previous draft by three anonymous reviewers.
[Wolfe et al., 1998] M. Wolfe, M. E. Schreiner, B. Re- hder, D. Laham, P. W. Foltz, W. Kintsch, and T. K.
Landauer. Learning from text: Matching readers and [DARPA, 1995] DARPA. Proceedings of the Sixth Mes- texts by Latent Semantic Analysis. Discourse Pro- sage Understanding Conference (MUC-6).
Kaufman Publishers, San Francisco, 1995.
[Deerwester et al., 1990] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. In-dexing by latent semantic analysis.

Source: http://reed.cs.depaul.edu/peterh/papers/Wiemer-Hastingsijcai99.pdf

neuroncology.eu

Dipartimento di Biologia Animale Piazza Botta 9-10 Pavia • Telefono: 0382 506289 • Fax: 0382 506290 • E-Mail: [email protected] Direttore: prof. Mauro Fasola DESCRIZIONE Il Dipartimento di Biologia Animale è sorto dalla confluenza di tre Istituti: Anatomia Comparata; Istologia, Embriologia eAntropologia; Zoologia, con la successiva afferenza dell’Istituto di Entomologia. Ad esso f

livrariaargumento.com.br

Rubem Braga, Antônio Bôto e o Vernéculoser famoso. Era um grande poeta, assim qualifi-cado pelos melhores críticos e, entre nós, pordo-me a passar uma temporada com ele. Manuel Bandeira. Grande poeta, também, paraseu amigo íntimo e confidente Fernando Pessoa,da vida. “Nada, nada de novo, rien de rien ”,que publicou na sua editora (dele, Pessoa), aescrevia, citando Edith Piaf.