Witpress.tex

Text Categorization
Fabrizio SebastianiIstituto di Scienza e Tecnologie dell’InformazioneConsiglio Nazionale delle RicercheVia Giuseppe Moruzzi, 156124 Pisa, ItalyE-mail: [email protected] Abstract
Text categorization (also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories froma predefined set. This task has several applications, including automatedindexing of scientific articles according to predefined thesauri of technicalterms, filing patents into patent directories, selective dissemination of infor-mation to information consumers, automated population of hierarchical cat-alogues of Web resources, spam filtering, identification of document genre,authorship attribution, survey coding, and even automated essay grading.
Automated text classification is attractive because it frees organizationsfrom the need of manually organizing document bases, which can be tooexpensive, or simply infeasible given the time constraints of the applicationor the number of documents involved. The accuracy of modern text clas-sification systems rivals that of trained human professionals, thanks to acombination of information retrieval (IR) technology and machine learning(ML) technology. This will outline the fundamental traits of the technologiesinvolved, of the applications that can feasibly be tackled through text clas-sification, and of the tools and resources that are available to the researcherand developer wishing to take up these technologies for deploying real-worldapplications.
1 Introduction
Text categorization (TC – also known as text classification, or topic spot-ting) is the task of automatically sorting a set of documents into categories(or classes, or topics) from a predefined set. This task, that falls at thecrossroads of information retrieval (IR) and machine learning (ML), haswitnessed a booming interest in the last ten years from researchers anddevelopers alike.
For IR researchers, this interest is one particular aspect of a general move- ment towards leveraging user data for taming the inherent subjectivity ofthe IR task1, i.e. taming the fact that it is the user, and only the user, whocan say whether a given item of information is relevant to a query she hasissued to a Web search engine, or to a private folder of hers in which docu-ments should be filed according to content. Wherever there are predefinedclasses, documents manually classified by the user are often available; asa consequence, this latter data can be exploited for automatically learningthe (extensional) meaning that the user attributes to the classes, therebyreaching levels of classification accuracy that would be unthinkable if thisdata were unavailable.
For ML researchers, this interest is due to the fact that IR applications prove an excellent and challenging benchmark for their own techniquesand methodologies, since IR applications usually feature extremely high-dimensional feature spaces (see Section 2.1) and provide data by the truck-load. In the last five years, this has resulted in more and more ML researchersadopting TC as one of their benchmark applications of choice, which meansthat cutting-edge ML techniques are being imported into TC with minimaldelay from their original invention.
For application developers, this interest is mainly due to the enormously increased need to handle larger and larger quantities of documents, a needemphasized by increased connectivity and availability of document basesof all types at all levels in the information chain. But this interest is alsodue to the fact that TC techniques have reached accuracy levels that rivalthe performance of trained professionals, and these accuracy levels can beachieved with high levels of efficiency on standard hw/sw resources. Thismeans that more and more organizations are automating all their activitiesthat can be cast as TC tasks.
This chapter thus purports to take a closer look at TC, by describing the standard methodology through which a TC system (henceforth: classifier )is built, and by reviewing techniques, applications, tools, and resources,pertaining to research and development in TC.
1This movement spans several IR tasks including text mining, document filtering and routing, text clustering, text summarization, information extraction, plus other tasks inwhich the basic technologies from these latter are used, including question answering andtopic detection and tracking. See e.g. the recent editions of the ACM SIGIR conferencefor representative examples of research in these fields.
The structure of this chapter is as follows. In Section 2 we will give a basic picture of how an automated TC system is built and tested. This willinvolve a discussion of the technology (mostly borrowed from IR) neededfor building the internal representations of the documents (Section 2.1),of the technology (borrowed from ML) for automatically building a classi-fier from a “training set” of preclassified documents (Section 2.2), and ofthe methodologies for evaluating the quality of the classifiers one has built(Section 2.3). Section 3 will instead discuss some actual technologies for per-forming all this, concentrating on representative, state-of-the-art examplesof them. In Section 4 we will instead discuss the main domains to whichTC is applied nowadays. Section 5 concludes, discussing possible avenues offurther research and development.
2 The basic picture
TC may be formalized as the task of approximating the unknown targetfunction Φ : D × C → {T, F } (that describes how documents ought to beclassified, according to a supposedly authoritative expert) by means of afunction ˆ Φ : D × C → {T, F } called the classifier, where C = {c1, . . . , c|C|} is a predefined set of categories and D is a (possibly infinite) set of documents.
If Φ(dj, ci) = T , then dj is called a positive example (or a member ) of ci,while if Φ(dj, ci) = F it is called a negative example of ci.
The categories are just symbolic labels, and no additional knowledge (of a procedural or declarative nature) of their meaning is usually available, and itis often the case that no metadata (such as e.g. publication date, documenttype, publication source) is available either. In these cases, classificationmust be accomplished only on the basis of knowledge extracted from thedocuments themselves. Since this case is the most general, this is the usualfocus of TC research, and will also constitute the focus of this chapter2.
However, when in a given application either external knowledge or metadatais available, heuristic techniques of any nature may be adopted in order toleverage on these data, either in combination or in isolation from the IRand ML techniques we will discuss here.
TC is a subjective task: when two experts (human or artificial) decide whether or not to classify document dj under category ci, they may disagree,and this in fact happens with relatively high frequency. A news article onGeorge W. Bush selling his shares of the Texas Bulls baseball team couldbe filed under Politics, or under Finance, or under Sport, or under any com-bination of the three, or even under neither, depending on the subjective 2A further reason why TC research rarely tackles the case of additionally available external knowledge is that these sources of knowledge may vary widely in type and format,thereby making each such application a case in its own from which any lesson learnedcan hardly be exported to different application contexts.
judgment of the expert. Because of this, the meaning of a category is subjec-tive, and the ML techniques described in Section 2.2, rather than attemptingto produce a “gold standard” of dubious existence, aim to reproduce thisvery subjectivity by examining its manifestations, i.e. the documents thatthe expert herself has manually classified under C. The kind of learning thatthese ML techniques engage in is usually called supervised learning, as it issupervised, or facilitated, by the knowledge of the preclassified data.
Depending on the application, TC may be either a single-label task (i.e.
exactly one ci ∈ C must be assigned to each dj ∈ D), or a multi-labeltask (i.e. any number 0 ≤ nj ≤ |C| of categories may be assigned to adocument dj ∈ D)3. A special case of single-label TC is binary TC, inwhich, given a category ci, each dj ∈ D must be assigned either to ci or to itscomplement ci. A binary classifier for ci is then a function ˆ that approximates the unknown target function Φi : D → {T, F }.
A problem of multi-label TC under C = {c1, . . . , c|C|} is usually tackled as |C| independent binary classification problems under {ci, ci}, for i =1, . . . , |C|. In this case, a classifier for C is thus actually composed of |C|binary classifiers.
From the ML standpoint, learning a binary classifier (and hence a multi- label classifier) is usually a simpler problem than learning a single-labelclassifier. As a consequence, while all classes of supervised ML techniques(among which the ones discussed in Section 2.2) deal with the binary clas-sification problem since their very invention, for some classes of techniques(e.g. support vector machines - see Section 2.2) a satisfactory solution ofthe single-class problem is still the object of active investigation [1]. In thischapter, unless otherwise noted, we will always implicitly refer to the binarycase.
Aside from actual operational use, which we will not discuss, we can roughly distinguish three different phases in the life cycle of a TC system,which have traditionally been tackled in isolation of each other (i.e. a solu-tion to one problem not being inluenced by the solutions given to the othertwo): document indexing, classifier learning, and classifier evaluation. Thethree following paragraphs are devoted to these three phases, respectively;for a more detailed treatment see Sections 5, 6 and 7, respectively, of [2].
2.1 Document indexing
Document indexing denotes the activity of mapping a document dj into acompact representation of its content that can be directly interpreted (i) bya classifier-building algorithm and (ii) by a classifier, once it has been built.
The document indexing methods usually employed in TC are borrowed fromIR, where a text dj is typically represented as a vector of term weights 3Somewhat confusingly, in the ML field the single-label case is dubbed the multiclass dj = w1j, . . . , w|T |j . Here, T is the dictionary, i.e. the set of terms (alsoknown as features) that occur at least once in at least k documents (inTC: in at least k training documents), and 0 ≤ wkj ≤ 1 quantifies theimportance of tk in characterizing the semantics of dj. Typical values of kare between 1 and 5.
An indexing method is characterized by (i) a definition of what a term is, and (ii) a method to compute term weights. Concerning (i), the mostfrequent choice is to identify terms either with the words occurring in thedocument (with the exception of stop words, i.e. topic-neutral words such asarticles and prepositions, which are eliminated in a pre-processing phase),or with their stems (i.e. their morphological roots, obtained by applyinga stemming algorithm [3]). A popular choice is to add to the set of wordsor stems a set of phrases, i.e. longer (and semantically more significant)language units extracted from the text by shallow parsing and/or statisti-cal techniques [4]. Concerning (ii), term weights may be binary-valued (i.e.
wkj ∈ {0, 1}) or real-valued (i.e. 0 ≤ wkj ≤ 1), depending on whetherthe classifier-building algorithm and the classifiers, once they have beenbuilt, require binary input or not. When weights are binary, these simplyindicate presence/absence of the term in the document. When weights arenon-binary, they are computed by either statistical or probabilistic tech-niques (see e.g. [5]), the former being the most common option. One pop-ular class of statistical term weighting functions is tf ∗ idf (see e.g. [6]),where two intuitions are at play: (a) the more frequently tk occurs in dj,the more important for dj it is (the term frequency intuition); (b) the moredocuments tk occurs in, the less discriminating it is, i.e. the smaller its con-tribution is in characterizing the semantics of a document in which it occurs(the inverse document frequency intuition). Weights computed by tf ∗ idftechniques are often normalized so as to contrast the tendency of tf ∗ idf toemphasize long documents.
In TC, unlike in IR, a dimensionality reduction phase is often applied so as to reduce the size of the document representations from T to a muchsmaller, predefined number. This has both the effect of reducing overfitting(i.e. the tendency of the classifier to better classify the data it has beentrained on than new unseen data), and to make the problem more manage-able for the learning method, since many such methods are known not toscale well to high problem sizes. Dimensionality reduction often takes theform of feature selection: each term is scored by means of a scoring functionthat captures its degree of (positive, and sometimes also negative) corre-lation with ci, and only the highest scoring terms are used for documentrepresentation. Alternatively, dimensionality reduction may take the formof feature extraction: a set of “artificial” terms is generated from the origi-nal term set in such a way that the newly generated terms are both fewerand stochastically more independent from each other than the original onesused to be.
2.2 Classifier learning
A text classifier for ci is automatically generated by a general inductiveprocess (the learner ) which, by observing the characteristics of a set ofdocuments preclassified under ci or ci, gleans the characteristics that a newunseen document should have in order to belong to ci. In order to buildclassifiers for C, one thus needs a set Ω of documents such that the valueof Φ(dj, ci) is known for every dj, ci ∈ × C. In experimental TC it iscustomary to partition Ω into three disjoint sets T r (the training set), V a(the validation set), and T e (the test set ). The training set is the set ofdocuments observing which the learner builds the classifier. The validationset is the set of documents on which the engineer fine-tunes the classifier,e.g. choosing for a parameter p on which the classifier depends, the valuethat has yielded the best effectiveness when evaluated on V a. The test setis the set on which the effectiveness of the classifier is finally evaluated.
In both the validation and test phase, “evaluating the effectiveness” meansrunning the classifier on a set of preclassified documents (V a or T e) andchecking the degree of correspondence between the output of the classifierand the preassigned classes.
Different learners have been applied in the TC literature. Some of these methods generate binary-valued classifiers of the required form ˆ C → {T, F }, but some others generate real-valued functions of the formCSV : D × C → [0, 1] (CSV standing for categorization status value). Forthese latter, a set of thresholds τi needs to be determined (typically, byexperimentation on a validation set) allowing to turn real-valued CSVs intothe final binary decisions [7].
It is worthwhile to notice that in several applications, the fact that a method implements a real-valued function can be profitably used, in whichcase determining thresholds is not needed. For instance, in applications inwhich the quality of the classification is of critical importance (e.g. in filingpatents into patent directories), post-editing of the classifier output by ahuman professional is often necessary. In this case, having the documentsranked in terms of their estimated relevance to the category may be useful,since the human editor can scan the ranked list starting from the documentsdeemed most appropriate for the category, and stop when desired.
2.3 Classifier evaluation
Training efficiency (i.e. average time required to build a classifier ˆ given corpus Ω), as well as classification efficiency (i.e. average time requiredto classify a document by means of ˆ Φi), and effectiveness (i.e. average cor- Φi’s classification behaviour) are all legitimate measures of suc- In TC research, effectiveness is usually considered the most important criterion, since it is the most reliable one when it comes to experimentally comparing different learners or different TC methodologies, given that effi-ciency depends on too volatile parameters (e.g. different sw/hw platforms).
In TC applications, however, all three parameters are important, and onemust carefully look for a tradeoff among them, depending on the applicationconstraints. For instance, in applications involving interaction with the user,a classifier with low classification efficiency is unsuitable. On the contrary,in multi-label TC applications involving thousands of categories, a classifierwith low training efficiency also might be inappropriate (since many thou-sands of classifiers need to be learnt). Anyway, effectiveness tends to be theprimary criterion in operational contexts too, since in most applications anineffective although efficient classifier will be hardly useful, or will involvetoo much post-editing work on the part of human professionals, which mightdefy the purpose of using an automated system.
In single-label TC, effectiveness is usually measured by accuracy, i.e. the percentage of correct classification decisions (error is the converse of accu-racy, i.e. E = 1 − A). However, in binary (henceforth: in multi-label) TC,accuracy is not an adequate measure. The reason for this is that in binaryTC applications the two categories ci and ci are usually unbalanced, i.e. onecontains far more members than the other4. In this case, building a classifierthat has high accuracy is trivial, since the trivial rejector, i.e. the classifierthat trivially assigns all documents to the most heavily populated category(i.e. ci), has indeed very high accuracy; and there are no applications inwhich one is interested in such a classifier5.
As a result, in binary TC it is often the case that effectiveness wrt category ci is measured by a combination of precision wrt ci (πi), the percentage ofdocuments deemed to belong to ci that in fact belong to it, and recall wrtci (ρi), the percentage of documents belonging to ci that are in fact deemedto belong to it.
In multi-label TC, when effectiveness is computed for several categories the precision and recall results for individual categories must be averaged insome way; here, one may opt for microaveraging (“categories count propor-tionally to the number of their positive training examples”) or for macroav-eraging (“all categories count the same”), depending on the applicationdesiderata (see Table 1). The former rewards classifiers that behave wellon heavily populated (“frequent”) categories, while classifiers that performwell also on infrequent categories are emphasized by the latter. It is oftenthe case that in TC research macroaveraging is the method of choice, since 4For example, the number of Web pages that should be filed under the category NuclearWasteDisposal is orders of magnitude smaller than the number of pages that shouldnot.
5One further consequence of adopting accuracy as the effectiveness measure when classes are unbalanced is that in the phase of parameter tuning on a validation set (seeSection 2.2), there will be a tendency to choose parameter values that make the classifierbehave very much like the trivial rejector.
Microaveraging
Macroaveraging
Precision (π)
Recall (ρ)
Table 1: Averaging precision and recall across different categories; TPi, T Ni, F Pi and F Ni refer to the sets of true positives, true neg-atives, false positives, and false negatives wrt ci, respectively.
producing classifiers that perform well also on infrequent categories is themost challenging problem of TC.
Since most classifiers can be arbitrarily tuned to emphasize recall at the expense of precision (and viceversa), only combinations of the twoare significant. The most popular way to combine the two is the function= (β2+1)πρ , for some value 0 ≤ β ≤ ∞; usually, β is taken to be equal to 1, which means that the function becomes F1 = 2πρ , i.e. the harmonic mean of precision and recall. Note that for the trivial rejector, π = 1 andρ = 0, so = 0 for any value of β (symmetrically, for the trivial acceptorit is true that π = 0, ρ = 1, and = 0 for any value of β).
Finally, it should be noted that some applications of TC require cost-based issued to be brought to bear on how effectiveness is computed, thus inducinga utility-theoretic notion of effectiveness. For instance, in spam filtering (i.e.
a binary TC task in which e-mail messages must be classified in the categorySpam or its complement NonSpam), precision is more important than recall,since filing a legitimate message under Spam is a more serious error (i.e. itbears more cost) than filing a junk message under NonSpam. One possibleway of taking this into account is using the measure with β = 1; usingvalues of 0 ≤ β < 1 corresponds to paying more attention to precision thanto recall, while by using values of 0 < β < ∞ one emphasizes recall at theexpense of precision.
3 Techniques
We now discuss some of the actual techniques for dealing with the prob-lems of document indexing and classifier learning, discussed in the previoussection. Presenting a complete review of them is outside the scope of thischapter; as a consequence, we will only hint at the various choices that areavailable to the designer, and will enter into some detail only for a few 3.1 Document indexing techniques
The TC community has not displayed much creativity in devising documentweighting techniques specific to TC. In fact, most of the works reported inthe TC literature so far use the standard document weighting techniques,either of a statistical or of a probabilistic nature, which are used in all othersubfields of IR, including text search (e.g. tf idf or BM25 – see [5]). Theonly exception to this we know is [8], where the idf component in tf idf isreplaced by a function learnt from training data, and aimed at assessinghow good a term is at discriminating categories from each other.
Also in TC, as in other subfields of IR, the use of larger indexing units, such as frequently adjacent pairs (aka “bigrams”) or syntactically deter-mined phrases, has not shown systematic patterns of improvement [4, 9],which means that terms are usually made to coincide with single words,stemmed or not.
Dimensionality reduction is tackled either by feature selection techniques, such as mutual information (aka information gain) [10], chi square [11], orgain ratio [8], or by feature extraction techniques, such as latent semanticindexing [12, 13] or term clustering [9]. Recent work on term extractionmethods has focused on methods specific to TC (or rather: specific to prob-lems in which training data exist), i.e. on supervised term clustering tech-niques [14, 15, 16], which have shown better performance that the previouslymentioned unsupervised techniques.
3.2 Classifier learning techniques
The number of classes of classifier learning techniques that have been usedin TC is bewildering. These include at the very least probabilistic methods,regression methods, decision tree and decision rule learners, neural networks,batch and incremental learners of linear classifiers, example-based meth-ods, support vector machines, genetic algorithms, hidden Markov models,and classifier committees (which include boosting methods). Rather thanattempting to say even a few words about each of them, we will introducein some detail two of them, namely support vector machines and boosting.
The reasons for this choice are twofold. First, these are the two methodsthat have unquestionably shown the best performance in comparative TCexperiments performed so far. Second, these are the newest methods in theclassifier learning arena, and the ones with the strongest justifications fromcomputational learning theory.
3.2.1 Support vector machines
The support vector machine (SVM) method has been introduced in TC by
Joachims [17, 18] and subsequently used in several other TC works [19, 20,
21]. In geometrical terms, it may be seen as the attempt to find, amongall the surfaces σ1, σ2, . . . in |T |-dimensional space that separate the pos-itive from the negative training examples (decision surfaces), the σi thatseparates the positives from the negatives by the widest possible margin,i.e. such that the minimal distance between the hyperplane and a trainingexample is maximum; results in computational learning theory indicate thatthis tends to minimize the generalization error, i.e. the error of the resultingclassifier on yet unseen examples. SVMs were usually conceived for binaryclassification problems [22], and only recently they have been adapted tomulticlass classification [1].
As argued by Joachims [17], one advantage that SVMs offer for TC is that dimensionality reduction is usually not needed, as SVMs tend to befairly robust to overfitting and can scale up to considerable dimensionalities.
Recent extensive experiments by Brank and colleagues [23] also indicate thatfeature selection tends to be detrimental to the performance of SVMs.
Recently, efficient algorithms for SVM learning have also been discovered; as a consequence, the use of SVMs for high-dimensional problems as TC isno more prohibitive for the point of view of computational cost.
There are currently several freely available packages for SVM learning.
The best known in the binary TC camp is the SvmLight package6, whileone that has been extended to also deal with the general single-label classi-fication problem is 3.2.2 Boosting
Classifier committees (aka ensembles) are based on the idea that k different
classifiers Φ1, . . . , Φk may be better than one if their individual judgments
are appropriately combined. In the boosting method [24, 25, 26, 27] the
k classifiers Φ1, . . . , Φk are obtained by the same learning method (here
called the weak learner ), and are trained not in a conceptually parallel
and independent way, but sequentially. In this way, in training classifier
Φt one may take into account how classifiers Φ1, . . . , Φt−1 perform on the
training examples, and concentrate on getting right those examples on which
Φ1, . . . , Φt−1 have performed worst.
Specifically, for learning classifier Φt each dj, ci pair is given an “impor- tance weight” ht (where h1 is set to be equal for all d represents how hard to get a correct decision for this pair was for classi-fiers Φ1, . . . , Φt−1. These weights are exploited in learning Φt, which will bespecially tuned to correctly solve the pairs with higher weight. Classifier Φtis then applied to the training documents, and as a result weights ht are updated to ht+1; in this update operation, pairs correctly classified by Φ will have their weight decreased, while pairs misclassified by Φt will have 6SVMlight is available from http://svmlight.joachims.org/7BSVM is available from http://www.csie.ntu.edu.tw/~cjlin/bsvm/ their weight increased. After all the k classifiers have been built, a weightedlinear combination rule is applied to yield the final committee.
Boosting has proven a powerful intuition, and the BoosTexter system8 has reached one of the highest levels of effectiveness reported in the literatureso far.
4 Applications
As mentioned in Section 1, the applications of TC are manifold. Commontraits among all of them are The need to handle and organize documents in which the textual component is either the unique, or dominant, or simplest to interpret,component.
The need to handle and organize large quantities of such documents, i.e. large enough that their manual organization into classes is eithertoo expensive or not feasible within the time constraints imposed bythe application.
The fact that the set of categories is known in advance, and its vari- Applications may instead vary along several dimensions: The nature of the documents; i.e. documents may be structured texts (such as e.g. scientific articles), newswire stories, classified ads, imagecaptions, e-mail messages, transcripts of spoken texts, hypertexts, orother.
If the documents are hypertextual, rather than textual, very differenttechniques may be used, since links provide a rich source of infor-mation on which classifier learning activity can leverage. Techniquesexploiting this intuition in a TC context have been presented in [28,29, 30, 31] and experimentally compared in [32].
The structure of the classification scheme, i.e. whether this is flat or hierarchical. Hierarchical classification schemes may in turn betree-shaped, or allow for multiple inheritance (i.e. be DAG-shaped).
Again, the hierarchical structure of the classification scheme may allowradically more efficient, and more effective too, classification algo-rithms, which can take advantage of early subtree pruning [33, 21, 34],improved selection of negative examples [35], or improved estimationof word occurrence statistics in leaf nodes [36, 37, 38, 39].
8BoosTexter is available from http://www.cs.princeton.edu/~schapire/boostexter.html9In practical applications, the set of categories does change from time to time. For instance, in indexing computer science scientific articles under the ACM classificationscheme, one needs to consider that this scheme is revised every five to ten years, toreflect changes in the CS discipline. This means that training documents need to becreated for newly introduced categories, and that training documents may have to beremoved for categories whose meaning has evolved.
The nature of the task, i.e. whether the task is single-label or multi- Hereafter, we briefly review some important applications of TC. Note thatthe borders between the different classes of applications listed here are fuzzy,and some of these may be considered special cases of others.
4.1 Automatic indexing for Boolean information retrieval systems
The application that has stimulated the research in TC from its very begin-ning, back in the ‘60s, to the ‘80s is that of automatic indexing of scientificarticles by means of a controlled dictionary, such as the ACM ClassificationScheme, where the categories are the entries of the controlled dictionary.
This is typically a multi-label task, since several index terms are usuallyassigned to each document.
Automatic indexing with controlled dictionaries is closely related to the automated metadata generation task. In digital libraries one is usually inter-ested in tagging documents by metadata that describe them under a vari-ety of aspects (e.g. creation date, document type or format, availability,etc.). Some of these metadata are thematic, i.e. their role is to describethe semantics of the document by means of bibliographic codes, keywordsor keyphrases. The generation of these metadata may thus be viewed as aproblem of document indexing with controlled dictionary, and thus tack-led by means of TC techniques. In the case of Web documents, metadatadescribing them will be needed for the Semantic Web to become a reality,and TC techniques applied to Web data may be envisaged as contributingpart of the solution to the huge problem of generating the metadata neededby Semantic Web resources.
4.2 Document organization
Indexing with a controlled vocabulary is an instance of the general problemof document base organization. In general, many other issues pertaining todocument organization and filing, be it for purposes of personal organizationor structuring of a corporate document base, may be addressed by TCtechniques. For instance, at the offices of a newspaper, it might be necessaryto classify all past articles in order to ease future retrieval in the case of newevents related to the ones described by the past articles. Possible categoriesmight be HomeNews, International, Money, Lifestyles, Fashion, but also finer-grained ones such as ThePittAnistonMarriage.
Another possible application in the same range is the organization of patents into categories for making later access easier, and of patent applica-tions for allowing patent officers to discover possible prior work on the sametopic [40]. This application, as all applications having to do with patentdata, introduces specific problems, since the description of the allegedlynovel technique, which is written by the patent applicant, may intentionally use non standard vocabulary in order to create the impression that the tech-nique is indeed novel. This use of non standard vocabulary may depress theperformance of a text classifier, since the assumption that underlies practi-cally all TC work is that training documents and test documents are drawnfrom the same word distribution.
4.3 Text filtering
Text filtering is the activity of classifying a stream of incoming documentsdispatched in an asynchronous way by an information producer to an infor-mation consumer. Typical cases of filtering systems are e-mail filters [41] (inwhich case the producer is actually a multiplicity of producers), newsfeed fil-ters [42], or filters of unsuitable content [43]. A filtering system should blockthe delivery of the documents the consumer is likely not interested in. Fil-tering is a case of binary TC, since it involves the classification of incomingdocuments in two disjoint categories, the relevant and the irrelevant. Addi-tionally, a filtering system may also further classify the documents deemedrelevant to the consumer into thematic categories of interest to the user.
A filtering system may be installed at the producer end, in which case itmust route the documents to the interested consumers only, or at the con-sumer end, in which case it must block the delivery of documents deemeduninteresting to the consumer.
In information science document filtering has a tradition dating back to the ’60s, when, addressed by systems of various degrees of automationand dealing with the multi-consumer case discussed above, it was calledselective dissemination of information or current awareness. The explosionin the availability of digital information has boosted the importance of suchsystems, which are nowadays being used in diverse contexts such as thecreation of personalized Web newspapers, junk e-mail blocking, and Usenetnews selection.
4.4 Hierarchical categorization of Web pages
TC has recently aroused a lot of interest also for its possible application toautomatically classifying Web pages, or sites, under the hierarchical cata-logues hosted by popular Internet portals. When Web documents are cat-alogued in this way, rather than issuing a query to a general-purpose Websearch engine a searcher may find it easier to first navigate in the hierar-chy of categories and then restrict her search to a particular category ofinterest. Classifying Web pages automatically has obvious advantages, sincethe manual categorization of a large enough subset of the Web is infeasi-ble. With respect to previously discussed TC applications, automatic Webpage categorization has two essential peculiarities (both discussed in Sec-tion 4), namely the hypertextual nature of the documents, and the typicallyhierarchical structure of the category set.
4.5 Word sense disambiguation
Word sense disambiguation (WSD) is the activity of finding, given theoccurrence in a text of an ambiguous (i.e. polysemous or homonymous)word, the sense of this particular word occurrence. For instance, bank mayhave (at least) two different senses in English, as in the Bank of England(a financial institution) or the bank of river Thames (a hydraulic engi-neering artifact). It is thus a WSD task to decide which of the above sensesthe occurrence of bank in Last week I borrowed some money from thebank has. WSD may be seen as a (single-label) TC task (see e.g. [44]) once,given a word w, we view the contexts of occurrence of w as documents andthe senses of w as categories.
4.6 Automated survey coding
Survey coding is the task of assigning a symbolic code from a predefinedset of such codes to the answer that a person has given in response to anopen-ended question in a questionnaire (aka survey). This task is usuallycarried out in order to group respondents according to a predefined schemebased on their answers. Survey coding has several applications, especially inthe social sciences, where the classification of respondents is functional tothe extraction of statistics on political opinions, health and lifestyle habits,customer satisfaction, brand fidelity, and patient satisfaction.
Survey coding is a difficult task, since the code that should be attributed to a respondent based on the answer she has given is a matter of subjectivejudgment, and thus requires expertise. The problem can be formulated as asingle-label TC problem [45], where the answers play the role of the docu-ments, and the codes that are applicable to the answers returned to a givenquestion play the role of the categories (different questions thus correspondto different TC problems).
4.7 Automated authorship attribution and genre classification
Authorship attribution is the task of determining the author of a text ofdisputed or unknown paternity, choosing from a predefined set of candidateauthors [46, 47, 48]. Authorship attribution has several applications, rangingfrom the literary (e.g. discovering who the author of a recently discoveredsonnet is) to the forensic (e.g. identifying the sender of an anonymous letter,or checking the authenticity of a letter allegedly authored by a given person).
Authorship attribution can also be seen as a single-label TC task, withpossible authors playing the role of the categories. This is an applicationin which a TC system typically cannot be taken at face value; usually, itsresult contributes an “opinion” on who the possible author might be, butthe final decision has to be taken by a human professional. As a result, aTC system that ranks the candidate authors in terms of their probability of being the true author, would be useful (see Section 2.2).
The intuitions that must be brought to bear in these applications are orthogonal to those that are at play in topic-based classification, since anauthor normally writes about multiple topics. Because of this, it is unlikelythat topic-based features can be good at discriminating among authors.
Rather, stylistic features are the most appropriate choice; for instance,vocabulary richness (i.e. ratio between number of distinct words and totalnumber of words), average word length, average sentence length, are impor-tant, in the sense that it is these features that tend “to give an authoraway”.
Genre classification is also an applicative context which bears remark- able similarities to authorship attribution. There are applicative contextsin which it is desirable to classify documents by genre, rather than bytopic [49, 50, 51, 52]. For instance, it might be desirable to classify arti-cles about scientific subjects into one of the two categories PopularScienceand HardScience, in order to decide whether they are suitable for publica-tion into popular science magazines or not; likewise, distinguishing betweenProductReviews and Advertisements might be useful for several applications.
In genre classification too, topic-dependent words are not good separatingfeatures, and specialized features need to be devised, which are often similarto the ones used for authorship attribution applications.
4.8 Spam filtering
Filtering spam (i.e. unsolicited bulk e-mail) is a task of increased applicativeinterest that lies at the crossroads between filtering and genre classifica-tion. In fact, it has the dynamical character of other filtering applications,such as e-mail filtering, and it cuts across different topics, as genre classi-fication. Several attempts, some of them quite successful, have been madeat applying standard text classification techniques to spam filtering, forapplications involving either personal mail [53, 19, 54] or mailing lists [55].
However, operational spam filters must rely not only on standard ML tech-niques, but also on manually selected features. In fact, similarly to the caseof genre classification or authorship attribution, it is the stylistic (i.e. topic-neutral) features that are important, rather than the topic-based ones. Infact, spam deals with a multiplicity of topics (from miraculous money mak-ing schemes to Viagra pills), and cues indicative of topics can hardly beeffective unless they are supplemented with other topic-neutral ones. Onthe contrary, a human eye may immediately recognize a spam message fromvisual cues, such as e.g. the amount of all-caps words in the subject line orin the text of the message, the number of exclamation marks in the sub-ject line, an unknown sender with an unknown Web e-mail address (e.g. [email protected]), or even the peculiar formatting of the messagebody. Representing these visual cues (as well as taking into accout otherstandard phrases such as “Make money fast!”) as features is important to the effectiveness of an operational spam filter.
One further problem that makes spam filtering challenging is the frequent unavailability of negative training messages. A software maker wishing tocustomize its spam filter for a particular client needs training examples;while positive ones (i.e. spam messages) are not hard to collect in large quan-tities, negative ones (i.e. legitimate messages) are difficult to find, becauseof privacy issues, since a company dealing with industrially sensitive datawill not disclose samples of their own incoming legitimate messages even tosomeone who is going to use these messages for improving a service to them.
In this case, ML methods that can do without negative examples need tobe used.
4.9 Other applications
The applications described above are just the major among the ones TChas been used for. Here, we only briefly hint at a few other ones.
Myers and colleagues [56], and Schapire and Singer [25] have attacked speech categorization by means of a combination of speech recognition andTC, in the context of a phone call routing application. Sable and Hatzi-vassiloglou classify instead images through the classification of their textualcaptions [57]. Larkey [58] instead uses TC to tackle automated essay grad-ing, where the different grades that can be attributed to an essay play therole of categories. In a question answering application, Li and Roth [59] clas-sify questions according to question type; this allows a question answeringsystem to focus on locating the right type of information for the right typeof question, thus improving the effectiveness of the overall system.
5 Conclusion
Text categorization has evolved, from the neglected research niche it usedto be until the late ‘80s, into a fully blossomed research field which hasdelivered efficient, effective, and overall workable solutions that have beenused in tackling a wide variety of real-world application domains. Key tothis success have been (i) the ever-increasing involvement of the machinelearning community in text categorization, which has lately resulted in theuse of the very latest machine learning technology within text categoriza-tion applications, and (ii) the availability of standard benchmarks (such asReuters-21578 and OHSUMED), which has encouraged research by provid-ing a setting in which different research efforts could be compared to eachother, and in which the best methods and algorithms could stand out.
Currently, text categorization research is pointing in several interesting directions. One of them is the attempt at finding better representations fortext; while the bag of words model is still the unsurpassed text representa-tion model, researchers have not renounced to the belief that a text mustbe something more that a mere collection of tokens, and that the quest for models more sophisticated than the bag of words model is still worthpursuing [60].
A further direction is investigating the scalability properties of text clas- sification systems, i.e. understanding whether the systems that have provedthe best in terms of effectiveness alone stand up to the challenge of dealingwith very large numbers of categories (e.g. in the tens of thousands) [61].
Last but not least are the attempts at solving the labeling bottleneck, i.e.
at coming to terms with the fact that labeling examples for training a textclassifier when labeled examples do not previously exist, is expensive. As aresult, there is increasing attention in text categorization by semi-supervisedmachine learning methods, i.e. by methods that can bootstrap off a smallset of labeled examples and leverage on unlabeled examples too [62].
References
[1] Crammer, K. & Singer, Y., On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning
Research
, 2, pp. 265–292, 2001.
[2] Sebastiani, F., Machine learning in automated text categorization.
ACM Computing Surveys, 34(1), pp. 1–47, 2002.
[3] Frakes, W.B., Stemming algorithms. Information Retrieval: Data Structures and Algorithms, eds. W.B. Frakes & R. Baeza-Yates, Pren-tice Hall: Englewood Cliffs, US, pp. 131–160, 1992.
[4] Caropreso, M.F., Matwin, S. & Sebastiani, F., A learner-independent evaluation of the usefulness of statistical phrases for automated textcategorization. Text Databases and Document Management: Theoryand Practice, ed. A.G. Chin, Idea Group Publishing: Hershey, US, pp.
78–102, 2001.
[5] Zobel, J. & Moffat, A., Exploring the similarity space. SIGIR Forum, 32(1), pp. 18–34, 1998.
[6] Salton, G. & Buckley, C., Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), pp. 513–523,
1988. Also reprinted in [63], pp. 323–328.
[7] Yang, Y., A study on thresholding strategies for text categoriza- tion. Proceedings of SIGIR-01, 24th ACM International Conference onResearch and Development in Information Retrieval, eds. W.B. Croft,D.J. Harper, D.H. Kraft & J. Zobel, ACM Press, New York, US: NewOrleans, US, pp. 137–145, 2001.
[8] Debole, F. & Sebastiani, F., Supervised term weighting for automated text categorization. Proceedings of SAC-03, 18th ACM Symposium onApplied Computing, ACM Press, New York, US: Melbourne, US, pp.
784–788, 2003.
[9] Lewis, D.D., An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of SIGIR-92, 15th ACMInternational Conference on Research and Development in Informa- tion Retrieval, eds. N.J. Belkin, P. Ingwersen & A.M. Pejtersen, ACMPress, New York, US: Kobenhavn, DK, pp. 37–50, 1992.
[10] Lewis, D.D. & Ringuette, M., A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Sympo-sium on Document Analysis and Information Retrieval, Las Vegas, US,pp. 81–93, 1994.
[11] Yang, Y. & Pedersen, J.O., A comparative study on feature selection in text categorization. Proceedings of ICML-97, 14th International Con-ference on Machine Learning, ed. D.H. Fisher, Morgan Kaufmann Pub-lishers, San Francisco, US: Nashville, US, pp. 412–420, 1997.
[12] Wiener, E.D., Pedersen, J.O. & Weigend, A.S., A neural network approach to topic spotting. Proceedings of SDAIR-95, 4th Annual Sym-posium on Document Analysis and Information Retrieval, Las Vegas,US, pp. 317–332, 1995.
utze, H., Hull, D.A. & Pedersen, J.O., A comparison of classifiers and document representations for the routing problem. Proceedingsof SIGIR-95, 18th ACM International Conference on Research andDevelopment in Information Retrieval, eds. E.A. Fox, P. Ingwersen &R. Fidel, ACM Press, New York, US: Seattle, US, pp. 229–237, 1995.
[14] Baker, L.D. & McCallum, A.K., Distributional clustering of words for text classification. Proceedings of SIGIR-98, 21st ACM InternationalConference on Research and Development in Information Retrieval,eds. W.B. Croft, A. Moffat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel,ACM Press, New York, US: Melbourne, AU, pp. 96–103, 1998.
[15] Bekkerman, R., El-Yaniv, R., Tishby, N. & Winter, Y., On feature distributional clustering for text categorization. Proceedings of SIGIR-01, 24th ACM International Conference on Research and Developmentin Information Retrieval, eds. W.B. Croft, D.J. Harper, D.H. Kraft &J. Zobel, ACM Press, New York, US: New Orleans, US, pp. 146–153,2001.
[16] Slonim, N. & Tishby, N., The power of word clusters for text classifi- cation. Proceedings of ECIR-01, 23rd European Colloquium on Infor-mation Retrieval Research, Darmstadt, DE, 2001.
[17] Joachims, T., Text categorization with support vector machines: learn- ing with many relevant features. Proceedings of ECML-98, 10th Euro-pean Conference on Machine Learning, eds. C. N´edellec & C. Rouveirol,Springer Verlag, Heidelberg, DE: Chemnitz, DE, pp. 137–142, 1998.
Published in the “Lecture Notes in Computer Science” series, number1398.
[18] Joachims, T., Transductive inference for text classification using sup- port vector machines. Proceedings of ICML-99, 16th International Con-ference on Machine Learning, eds. I. Bratko & S. Dzeroski, MorganKaufmann Publishers, San Francisco, US: Bled, SL, pp. 200–209, 1999.
[19] Drucker, H., Vapnik, V. & Wu, D., Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), pp.
[20] Dumais, S.T., Platt, J., Heckerman, D. & Sahami, M., Inductive learn- ing algorithms and representations for text categorization. Proceedingsof CIKM-98, 7th ACM International Conference on Information andKnowledge Management, eds. G. Gardarin, J.C. French, N. Pissinou,K. Makki & L. Bouganim, ACM Press, New York, US: Bethesda, US,pp. 148–155, 1998.
[21] Dumais, S.T. & Chen, H., Hierarchical classification of Web con- tent. Proceedings of SIGIR-00, 23rd ACM International Conference onResearch and Development in Information Retrieval, eds. N.J. Belkin,P. Ingwersen & M.K. Leong, ACM Press, New York, US: Athens, GR,pp. 256–263, 2000.
[22] Vapnik, V.N., The nature of statistical learning theory. Springer Verlag: action of feature selection methods and linear classification models.
Proceedings of the ICML-02 Workshop on Text Learning, Sydney, AU,2002.
[24] Schapire, R.E., Singer, Y. & Singhal, A., Boosting and Rocchio applied to text filtering. Proceedings of SIGIR-98, 21st ACM InternationalConference on Research and Development in Information Retrieval,eds. W.B. Croft, A. Moffat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel,ACM Press, New York, US: Melbourne, AU, pp. 215–223, 1998.
[25] Schapire, R.E. & Singer, Y., BoosTexter: a boosting-based system for text categorization. Machine Learning, 39(2/3), pp. 135–168, 2000.
[26] Sebastiani, F., Sperduti, A. & Valdambrini, N., An improved boosting algorithm and its application to automated text categorization. Pro-ceedings of CIKM-00, 9th ACM International Conference on Informa-tion and Knowledge Management, eds. A. Agah, J. Callan & E. Run-densteiner, ACM Press, New York, US: McLean, US, pp. 78–85, 2000.
[27] Nardiello, P., Sebastiani, F. & Sperduti, A., Discretizing continuous attributes in AdaBoost for text categorization. Proceedings of ECIR-03,25th European Conference on Information Retrieval, ed. F. Sebastiani,Springer Verlag: Pisa, IT, pp. 320–334, 2003.
[28] Chakrabarti, S., Dom, B.E. & Indyk, P., Enhanced hypertext cate- gorization using hyperlinks. Proceedings of SIGMOD-98, ACM Inter-national Conference on Management of Data, eds. L.M. Haas &A. Tiwary, ACM Press, New York, US: Seattle, US, pp. 307–318, 1998.
[29] Oh, H.J., Myaeng, S.H. & Lee, M.H., A practical hypertext catego- rization method using links and incrementally available class informa-tion. Proceedings of SIGIR-00, 23rd ACM International Conference onResearch and Development in Information Retrieval, eds. N.J. Belkin,P. Ingwersen & M.K. Leong, ACM Press, New York, US: Athens, GR,pp. 264–271, 2000.
[30] Slattery, S. & Craven, M., Discovering test set regularities in relational domains. Proceedings of ICML-00, 17th International Conference onMachine Learning, ed. P. Langley, Morgan Kaufmann Publishers, SanFrancisco, US: Stanford, US, pp. 895–902, 2000.
[31] Getoor, L., Segal, E., Taskar, B. & Koller, D., Probabilistic models of text and link structure for hypertext classification. Proceedings of theIJCAI-01 Workshop on Text Learning: Beyond Supervision, Seattle,US, pp. 24–29, 2001.
[32] Yang, Y., Slattery, S. & Ghani, R., A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3), pp.
219–241, 2002. Special Issue on Automated Text Categorization.
[33] Chakrabarti, S., Dom, B.E., Agrawal, R. & Raghavan, P., Scalable feature selection, classification and signature generation for organizing
large text databases into hierarchical topic taxonomies. Journal of Very
Large Data Bases
, 7(3), pp. 163–178, 1998.
[34] Koller, D. & Sahami, M., Hierarchically classifying documents using very few words. Proceedings of ICML-97, 14th International Conferenceon Machine Learning, ed. D.H. Fisher, Morgan Kaufmann Publishers,San Francisco, US: Nashville, US, pp. 170–178, 1997.
[35] Ng, H.T., Goh, W.B. & Low, K.L., Feature selection, perceptron learn- ing, and a usability case study for text categorization. Proceedings ofSIGIR-97, 20th ACM International Conference on Research and Devel-opment in Information Retrieval, eds. N.J. Belkin, A.D. Narasimhalu& P. Willett, ACM Press, New York, US: Philadelphia, US, pp. 67–73,1997.
E., Goutte, C., Popat, K. & Chen, F., A hierarchical model for clustering and categorising documents. Proceedings of ECIR-02,24th European Colloquium on Information Retrieval Research, eds.
F. Crestani, M. Girolami & C.J.V. Rijsbergen, Springer Verlag, Heidel-berg, DE: Glasgow, UK, pp. 229–247, 2002. Published in the “LectureNotes in Computer Science” series, number 2291.
[37] McCallum, A.K., Rosenfeld, R., Mitchell, T.M. & Ng, A.Y., Improving text classification by shrinkage in a hierarchy of classes. Proceedingsof ICML-98, 15th International Conference on Machine Learning, ed.
J.W. Shavlik, Morgan Kaufmann Publishers, San Francisco, US: Madi-son, US, pp. 359–367, 1998.
[38] Toutanova, K., Chen, F., Popat, K. & Hofmann, T., Text classification in a hierarchical mixture model for small training sets. Proceedingsof CIKM-01, 10th ACM International Conference on Information andKnowledge Management, eds. H. Paques, L. Liu & D. Grossman, ACMPress, New York, US: Atlanta, US, pp. 105–113, 2001.
[39] Vinokourov, A. & Girolami, M., A probabilistic framework for the hier- archic organisation and classification of document collections. Journal
of Intelligent Information Systems
, 18(2/3), pp. 153–172, 2002. Special
Issue on Automated Text Categorization.
[40] Larkey, L.S., A patent search and classification system. Proceedings of DL-99, 4th ACM Conference on Digital Libraries, eds. E.A. Fox &N. Rowe, ACM Press, New York, US: Berkeley, US, pp. 179–187, 1999.
[41] Weiss, S.M., Apt´e, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T. & Hampp, T., Maximizing text-mining performance. IEEE Intelli-
gent Systems
, 14(4), pp. 63–69, 1999.
[42] Amati, G., D’Aloisi, D., Giannini, V. & Ubaldini, F., A framework for filtering news and managing distributed data. Journal of Universal
Computer Science
, 3(8), pp. 1007–1021, 1997.
[43] Chandrinos, K.V., Androutsopoulos, I., Paliouras, G. & Spyropou- los, C.D., Automatic Web rating: Filtering obscene content on theWeb. Proceedings of ECDL-00, 4th European Conference on Researchand Advanced Technology for Digital Libraries, eds. J.L. Borbinha &T. Baker, Springer Verlag, Heidelberg, DE: Lisbon, PT, pp. 403–406,2000. Published in the “Lecture Notes in Computer Science” series,number 1923.
arquez, L. & Rigau, G., Boosting applied to word sense disambiguation. Proceedings of ECML-00, 11th European Conferenceon Machine Learning, eds. R.L.D. M´ lag, Heidelberg, DE: Barcelona, ES, pp. 129–141, 2000. Published inthe “Lecture Notes in Computer Science” series, number 1810.
[45] Giorgetti, D. & Sebastiani, F., Automating survey coding by multiclass text categorization techniques. Journal of the American Society forInformation Science and Technology, 2003. Forthcoming.
[46] Vel, O.Y.D., Anderson, A., Corney, M. & Mohay, G.M., Mining email content for author identification forensics. SIGMOD Record, 30(4), pp.
55–64, 2001.
[47] Forsyth, R.S., New directions in text categorization. Causal models and intelligent data management, ed. A. Gammerman, Springer Verlag:Heidelberg, DE, pp. 151–185, 1999.
[48] Diederich, J., Kindermann, J., Leopold, E. & Paaß, G., Authorship attribution with support vector machines. Applied Intelligence, 19(1/2),
pp. 109–123, 2003.
[49] Finn, A., Kushmerick, N. & Smyth, B., Genre classification and domain transfer for information filtering. Proceedings of ECIR-02, 24th Euro-pean Colloquium on Information Retrieval Research, eds. F. Crestani,M. Girolami & C.J.V. Rijsbergen, Springer Verlag, Heidelberg, DE:Glasgow, UK, pp. 353–362, 2002. Published in the “Lecture Notes inComputer Science” series, number 2291.
[50] Kessler, B., Nunberg, G. & Sch¨ genre. Proceedings of ACL-97, 35th Annual Meeting of the Associationfor Computational Linguistics, eds. P.R. Cohen & W. Wahlster, Mor-gan Kaufmann Publishers, San Francisco, US: Madrid, ES, pp. 32–38,1997.
[51] Lee, Y.B. & Myaeng, S.H., Text genre classification with genre- revealing and subject-revealing features. Proceedings of SIGIR-02, 25th ACM International Conference on Research and Development in Infor-mation Retrieval, eds. M. Beaulieu, R. Baeza-Yates, S.H. Myaeng &K. J¨ arvelin, ACM Press, New York, US: Tampere, FI, pp. 145–150, [52] Stamatatos, E., Fakotakis, N. & Kokkinakis, G., Automatic text cat- egorization in terms of genre and author. Computational Linguistics,
26(4), pp. 471–495, 2000.
[53] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V. & Spyropoulos, C.D., An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. Proceedingsof SIGIR-00, 23rd ACM International Conference on Research andDevelopment in Information Retrieval, eds. N.J. Belkin, P. Ingwersen& M.K. Leong, ACM Press, New York, US: Athens, GR, pp. 160–167,2000.
omez-Hidalgo, J.M., Evaluating cost-sensitive unsolicited bulk email categorization. Proceedings of SAC-02, 17th ACM Symposium onApplied Computing, Madrid, ES, pp. 615–620, 2002.
[55] Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spy- ropoulos, C.D. & Stamatopoulos, P., A memory-based approach to
anti-spam filtering for mailing lists. Information Retrieval, 6(1), pp.
49–73, 2003.
[56] Myers, K., Kearns, M., Singh, S. & Walker, M.A., A boosting approach to topic spotting on subdialogues. Proceedings of ICML-00, 17th Inter-national Conference on Machine Learning, ed. P. Langley, MorganKaufmann Publishers, San Francisco, US: Stanford, US, pp. 655–662,2000.
[57] Sable, C.L. & Hatzivassiloglou, V., Text-based approaches for non- topical image categorization. International Journal of Digital Libraries,
3(3), pp. 261–275, 2000.
[58] Larkey, L.S., Automatic essay grading using text categorization tech- niques. Proceedings of SIGIR-98, 21st ACM International Conferenceon Research and Development in Information Retrieval, eds. W.B.
Croft, A. Moffat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel, ACMPress, New York, US: Melbourne, AU, pp. 90–95, 1998.
[59] Li, X. & Roth, D., Learning question classifiers. Proceedings of COLING-02, 19th International Conference on Computational Lin-guistics, Taipei, TW, 2002.
[60] Koster, C.H. & Seutter, M., Taming wild phrases. Proceedings of ECIR- 03, 25th European Conference on Information Retrieval, ed. F. Sebas-tiani, Springer Verlag: Pisa, IT, pp. 161–176, 2003.
[61] Yang, Y., A scalability analysis of classifiers in text categorization. Pro- ceedings of SIGIR-03, 26th ACM International Conference on Researchand Development in Information Retrieval, ACM Press, New York, US:Toronto, CA, 2003.
[62] Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T.M., Text clas- sification from labeled and unlabeled documents using EM. Machine
Learning
, 39(2/3), pp. 103–134, 2000.
[63] Sparck Jones, K. & Willett, P., (eds.) Readings in information retrieval.
Morgan Kaufmann: San Mateo, US, 1997.

Source: http://lvk.cs.msu.su/~bruzz/articles/classification/text-categorization.pdf

Microsoft word - d6.5 final v1.2.doc

Good Practice in Traditional Chinese Medicine Research in the Post-genomic Era Report on the reviewed literature relating to clinical use of Document description Report on the reviewed literature relating to clinical use of CHM This document is a summary of the recent research into Chinese herbal medicine in selected conditions. Andrew Flower, George Lewith and Dan Jaing – edit

Aspirin

MSDS Number: A7686 * * * * * Effective Date: 11/02/01 * * * * * Supercedes: 11/17/99 1. Product Identification Synonyms: 2-(acetyloxy)benzoic acid; salicylic acid acetate; acetysalicylic acid CAS No.: 50-78-2 Molecular Weight: 180.15 Chemical Formula: C9H8O4 Product Codes: 0033 2. Composition/Information on Ingredients ---------------------------------------

Copyright © 2008-2018 All About Drugs