Witpress.tex

Text Categorization
Fabrizio SebastianiIstituto di Scienza e Tecnologie dell’InformazioneConsiglio Nazionale delle RicercheVia Giuseppe Moruzzi, 156124 Pisa, ItalyE-mail: [email protected] Abstract
Text categorization (also known as text classiﬁcation, or topic spotting) is the task of automatically sorting a set of documents into categories froma predeﬁned set. This task has several applications, including automatedindexing of scientiﬁc articles according to predeﬁned thesauri of technicalterms, ﬁling patents into patent directories, selective dissemination of infor-mation to information consumers, automated population of hierarchical cat-alogues of Web resources, spam ﬁltering, identiﬁcation of document genre,authorship attribution, survey coding, and even automated essay grading.
Automated text classiﬁcation is attractive because it frees organizationsfrom the need of manually organizing document bases, which can be tooexpensive, or simply infeasible given the time constraints of the applicationor the number of documents involved. The accuracy of modern text clas-siﬁcation systems rivals that of trained human professionals, thanks to acombination of information retrieval (IR) technology and machine learning(ML) technology. This will outline the fundamental traits of the technologiesinvolved, of the applications that can feasibly be tackled through text clas-siﬁcation, and of the tools and resources that are available to the researcherand developer wishing to take up these technologies for deploying real-worldapplications.
1 Introduction
Text categorization (TC – also known as text classiﬁcation, or topic spot-ting) is the task of automatically sorting a set of documents into categories(or classes, or topics) from a predeﬁned set. This task, that falls at thecrossroads of information retrieval (IR) and machine learning (ML), haswitnessed a booming interest in the last ten years from researchers anddevelopers alike.
For IR researchers, this interest is one particular aspect of a general move- ment towards leveraging user data for taming the inherent subjectivity ofthe IR task1, i.e. taming the fact that it is the user, and only the user, whocan say whether a given item of information is relevant to a query she hasissued to a Web search engine, or to a private folder of hers in which docu-ments should be ﬁled according to content. Wherever there are predeﬁnedclasses, documents manually classiﬁed by the user are often available; asa consequence, this latter data can be exploited for automatically learningthe (extensional) meaning that the user attributes to the classes, therebyreaching levels of classiﬁcation accuracy that would be unthinkable if thisdata were unavailable.
For ML researchers, this interest is due to the fact that IR applications prove an excellent and challenging benchmark for their own techniquesand methodologies, since IR applications usually feature extremely high-dimensional feature spaces (see Section 2.1) and provide data by the truck-load. In the last ﬁve years, this has resulted in more and more ML researchersadopting TC as one of their benchmark applications of choice, which meansthat cutting-edge ML techniques are being imported into TC with minimaldelay from their original invention.
For application developers, this interest is mainly due to the enormously increased need to handle larger and larger quantities of documents, a needemphasized by increased connectivity and availability of document basesof all types at all levels in the information chain. But this interest is alsodue to the fact that TC techniques have reached accuracy levels that rivalthe performance of trained professionals, and these accuracy levels can beachieved with high levels of eﬃciency on standard hw/sw resources. Thismeans that more and more organizations are automating all their activitiesthat can be cast as TC tasks.
This chapter thus purports to take a closer look at TC, by describing the standard methodology through which a TC system (henceforth: classiﬁer )is built, and by reviewing techniques, applications, tools, and resources,pertaining to research and development in TC.
1This movement spans several IR tasks including text mining, document ﬁltering and routing, text clustering, text summarization, information extraction, plus other tasks inwhich the basic technologies from these latter are used, including question answering andtopic detection and tracking. See e.g. the recent editions of the ACM SIGIR conferencefor representative examples of research in these ﬁelds.
The structure of this chapter is as follows. In Section 2 we will give a basic picture of how an automated TC system is built and tested. This willinvolve a discussion of the technology (mostly borrowed from IR) neededfor building the internal representations of the documents (Section 2.1),of the technology (borrowed from ML) for automatically building a classi-ﬁer from a “training set” of preclassiﬁed documents (Section 2.2), and ofthe methodologies for evaluating the quality of the classiﬁers one has built(Section 2.3). Section 3 will instead discuss some actual technologies for per-forming all this, concentrating on representative, state-of-the-art examplesof them. In Section 4 we will instead discuss the main domains to whichTC is applied nowadays. Section 5 concludes, discussing possible avenues offurther research and development.
2 The basic picture
TC may be formalized as the task of approximating the unknown targetfunction Φ : D × C → {T, F } (that describes how documents ought to beclassiﬁed, according to a supposedly authoritative expert) by means of afunction ˆ Φ : D × C → {T, F } called the classiﬁer, where C = {c1, . . . , c|C|} is a predeﬁned set of categories and D is a (possibly inﬁnite) set of documents.
If Φ(dj, ci) = T , then dj is called a positive example (or a member ) of ci,while if Φ(dj, ci) = F it is called a negative example of ci.
The categories are just symbolic labels, and no additional knowledge (of a procedural or declarative nature) of their meaning is usually available, and itis often the case that no metadata (such as e.g. publication date, documenttype, publication source) is available either. In these cases, classiﬁcationmust be accomplished only on the basis of knowledge extracted from thedocuments themselves. Since this case is the most general, this is the usualfocus of TC research, and will also constitute the focus of this chapter2.
However, when in a given application either external knowledge or metadatais available, heuristic techniques of any nature may be adopted in order toleverage on these data, either in combination or in isolation from the IRand ML techniques we will discuss here.
TC is a subjective task: when two experts (human or artiﬁcial) decide whether or not to classify document dj under category ci, they may disagree,and this in fact happens with relatively high frequency. A news article onGeorge W. Bush selling his shares of the Texas Bulls baseball team couldbe ﬁled under Politics, or under Finance, or under Sport, or under any com-bination of the three, or even under neither, depending on the subjective 2A further reason why TC research rarely tackles the case of additionally available external knowledge is that these sources of knowledge may vary widely in type and format,thereby making each such application a case in its own from which any lesson learnedcan hardly be exported to diﬀerent application contexts.
judgment of the expert. Because of this, the meaning of a category is subjec-tive, and the ML techniques described in Section 2.2, rather than attemptingto produce a “gold standard” of dubious existence, aim to reproduce thisvery subjectivity by examining its manifestations, i.e. the documents thatthe expert herself has manually classiﬁed under C. The kind of learning thatthese ML techniques engage in is usually called supervised learning, as it issupervised, or facilitated, by the knowledge of the preclassiﬁed data.
Depending on the application, TC may be either a single-label task (i.e.
exactly one ci ∈ C must be assigned to each dj ∈ D), or a multi-labeltask (i.e. any number 0 ≤ nj ≤ |C| of categories may be assigned to adocument dj ∈ D)3. A special case of single-label TC is binary TC, inwhich, given a category ci, each dj ∈ D must be assigned either to ci or to itscomplement ci. A binary classiﬁer for ci is then a function ˆ that approximates the unknown target function Φi : D → {T, F }.
A problem of multi-label TC under C = {c1, . . . , c|C|} is usually tackled as |C| independent binary classiﬁcation problems under {ci, ci}, for i =1, . . . , |C|. In this case, a classiﬁer for C is thus actually composed of |C|binary classiﬁers.
From the ML standpoint, learning a binary classiﬁer (and hence a multi- label classiﬁer) is usually a simpler problem than learning a single-labelclassiﬁer. As a consequence, while all classes of supervised ML techniques(among which the ones discussed in Section 2.2) deal with the binary clas-siﬁcation problem since their very invention, for some classes of techniques(e.g. support vector machines - see Section 2.2) a satisfactory solution ofthe single-class problem is still the object of active investigation [1]. In thischapter, unless otherwise noted, we will always implicitly refer to the binarycase.
Aside from actual operational use, which we will not discuss, we can roughly distinguish three diﬀerent phases in the life cycle of a TC system,which have traditionally been tackled in isolation of each other (i.e. a solu-tion to one problem not being inluenced by the solutions given to the othertwo): document indexing, classiﬁer learning, and classiﬁer evaluation. Thethree following paragraphs are devoted to these three phases, respectively;for a more detailed treatment see Sections 5, 6 and 7, respectively, of [2].
2.1 Document indexing
Document indexing denotes the activity of mapping a document dj into acompact representation of its content that can be directly interpreted (i) bya classiﬁer-building algorithm and (ii) by a classiﬁer, once it has been built.
The document indexing methods usually employed in TC are borrowed fromIR, where a text dj is typically represented as a vector of term weights 3Somewhat confusingly, in the ML ﬁeld the single-label case is dubbed the multiclass dj = w1j, . . . , w|T |j . Here, T is the dictionary, i.e. the set of terms (alsoknown as features) that occur at least once in at least k documents (inTC: in at least k training documents), and 0 ≤ wkj ≤ 1 quantiﬁes theimportance of tk in characterizing the semantics of dj. Typical values of kare between 1 and 5.
An indexing method is characterized by (i) a deﬁnition of what a term is, and (ii) a method to compute term weights. Concerning (i), the mostfrequent choice is to identify terms either with the words occurring in thedocument (with the exception of stop words, i.e. topic-neutral words such asarticles and prepositions, which are eliminated in a pre-processing phase),or with their stems (i.e. their morphological roots, obtained by applyinga stemming algorithm [3]). A popular choice is to add to the set of wordsor stems a set of phrases, i.e. longer (and semantically more signiﬁcant)language units extracted from the text by shallow parsing and/or statisti-cal techniques [4]. Concerning (ii), term weights may be binary-valued (i.e.
wkj ∈ {0, 1}) or real-valued (i.e. 0 ≤ wkj ≤ 1), depending on whetherthe classiﬁer-building algorithm and the classiﬁers, once they have beenbuilt, require binary input or not. When weights are binary, these simplyindicate presence/absence of the term in the document. When weights arenon-binary, they are computed by either statistical or probabilistic tech-niques (see e.g. [5]), the former being the most common option. One pop-ular class of statistical term weighting functions is tf ∗ idf (see e.g. [6]),where two intuitions are at play: (a) the more frequently tk occurs in dj,the more important for dj it is (the term frequency intuition); (b) the moredocuments tk occurs in, the less discriminating it is, i.e. the smaller its con-tribution is in characterizing the semantics of a document in which it occurs(the inverse document frequency intuition). Weights computed by tf ∗ idftechniques are often normalized so as to contrast the tendency of tf ∗ idf toemphasize long documents.
In TC, unlike in IR, a dimensionality reduction phase is often applied so as to reduce the size of the document representations from T to a muchsmaller, predeﬁned number. This has both the eﬀect of reducing overﬁtting(i.e. the tendency of the classiﬁer to better classify the data it has beentrained on than new unseen data), and to make the problem more manage-able for the learning method, since many such methods are known not toscale well to high problem sizes. Dimensionality reduction often takes theform of feature selection: each term is scored by means of a scoring functionthat captures its degree of (positive, and sometimes also negative) corre-lation with ci, and only the highest scoring terms are used for documentrepresentation. Alternatively, dimensionality reduction may take the formof feature extraction: a set of “artiﬁcial” terms is generated from the origi-nal term set in such a way that the newly generated terms are both fewerand stochastically more independent from each other than the original onesused to be.
2.2 Classiﬁer learning
A text classiﬁer for ci is automatically generated by a general inductiveprocess (the learner ) which, by observing the characteristics of a set ofdocuments preclassiﬁed under ci or ci, gleans the characteristics that a newunseen document should have in order to belong to ci. In order to buildclassiﬁers for C, one thus needs a set Ω of documents such that the valueof Φ(dj, ci) is known for every dj, ci ∈ Ω × C. In experimental TC it iscustomary to partition Ω into three disjoint sets T r (the training set), V a(the validation set), and T e (the test set ). The training set is the set ofdocuments observing which the learner builds the classiﬁer. The validationset is the set of documents on which the engineer ﬁne-tunes the classiﬁer,e.g. choosing for a parameter p on which the classiﬁer depends, the valuethat has yielded the best eﬀectiveness when evaluated on V a. The test setis the set on which the eﬀectiveness of the classiﬁer is ﬁnally evaluated.
In both the validation and test phase, “evaluating the eﬀectiveness” meansrunning the classiﬁer on a set of preclassiﬁed documents (V a or T e) andchecking the degree of correspondence between the output of the classiﬁerand the preassigned classes.
Diﬀerent learners have been applied in the TC literature. Some of these methods generate binary-valued classiﬁers of the required form ˆ C → {T, F }, but some others generate real-valued functions of the formCSV : D × C → [0, 1] (CSV standing for categorization status value). Forthese latter, a set of thresholds τi needs to be determined (typically, byexperimentation on a validation set) allowing to turn real-valued CSVs intothe ﬁnal binary decisions [7].
It is worthwhile to notice that in several applications, the fact that a method implements a real-valued function can be proﬁtably used, in whichcase determining thresholds is not needed. For instance, in applications inwhich the quality of the classiﬁcation is of critical importance (e.g. in ﬁlingpatents into patent directories), post-editing of the classiﬁer output by ahuman professional is often necessary. In this case, having the documentsranked in terms of their estimated relevance to the category may be useful,since the human editor can scan the ranked list starting from the documentsdeemed most appropriate for the category, and stop when desired.
2.3 Classiﬁer evaluation
Training eﬃciency (i.e. average time required to build a classiﬁer ˆ given corpus Ω), as well as classiﬁcation eﬃciency (i.e. average time requiredto classify a document by means of ˆ Φi), and eﬀectiveness (i.e. average cor- Φi’s classiﬁcation behaviour) are all legitimate measures of suc- In TC research, eﬀectiveness is usually considered the most important criterion, since it is the most reliable one when it comes to experimentally comparing diﬀerent learners or diﬀerent TC methodologies, given that eﬃ-ciency depends on too volatile parameters (e.g. diﬀerent sw/hw platforms).
In TC applications, however, all three parameters are important, and onemust carefully look for a tradeoﬀ among them, depending on the applicationconstraints. For instance, in applications involving interaction with the user,a classiﬁer with low classiﬁcation eﬃciency is unsuitable. On the contrary,in multi-label TC applications involving thousands of categories, a classiﬁerwith low training eﬃciency also might be inappropriate (since many thou-sands of classiﬁers need to be learnt). Anyway, eﬀectiveness tends to be theprimary criterion in operational contexts too, since in most applications anineﬀective although eﬃcient classiﬁer will be hardly useful, or will involvetoo much post-editing work on the part of human professionals, which mightdefy the purpose of using an automated system.
In single-label TC, eﬀectiveness is usually measured by accuracy, i.e. the percentage of correct classiﬁcation decisions (error is the converse of accu-racy, i.e. E = 1 − A). However, in binary (henceforth: in multi-label) TC,accuracy is not an adequate measure. The reason for this is that in binaryTC applications the two categories ci and ci are usually unbalanced, i.e. onecontains far more members than the other4. In this case, building a classiﬁerthat has high accuracy is trivial, since the trivial rejector, i.e. the classiﬁerthat trivially assigns all documents to the most heavily populated category(i.e. ci), has indeed very high accuracy; and there are no applications inwhich one is interested in such a classiﬁer5.
As a result, in binary TC it is often the case that eﬀectiveness wrt category ci is measured by a combination of precision wrt ci (πi), the percentage ofdocuments deemed to belong to ci that in fact belong to it, and recall wrtci (ρi), the percentage of documents belonging to ci that are in fact deemedto belong to it.
In multi-label TC, when eﬀectiveness is computed for several categories the precision and recall results for individual categories must be averaged insome way; here, one may opt for microaveraging (“categories count propor-tionally to the number of their positive training examples”) or for macroav-eraging (“all categories count the same”), depending on the applicationdesiderata (see Table 1). The former rewards classiﬁers that behave wellon heavily populated (“frequent”) categories, while classiﬁers that performwell also on infrequent categories are emphasized by the latter. It is oftenthe case that in TC research macroaveraging is the method of choice, since 4For example, the number of Web pages that should be ﬁled under the category NuclearWasteDisposal is orders of magnitude smaller than the number of pages that shouldnot.
5One further consequence of adopting accuracy as the eﬀectiveness measure when classes are unbalanced is that in the phase of parameter tuning on a validation set (seeSection 2.2), there will be a tendency to choose parameter values that make the classiﬁerbehave very much like the trivial rejector.
Microaveraging
Macroaveraging
Precision (π)
Recall (ρ)
Table 1: Averaging precision and recall across diﬀerent categories; TPi, T Ni, F Pi and F Ni refer to the sets of true positives, true neg-atives, false positives, and false negatives wrt ci, respectively.
producing classiﬁers that perform well also on infrequent categories is themost challenging problem of TC.
Since most classiﬁers can be arbitrarily tuned to emphasize recall at the expense of precision (and viceversa), only combinations of the twoare signiﬁcant. The most popular way to combine the two is the functionFβ = (β2+1)πρ , for some value 0 ≤ β ≤ ∞; usually, β is taken to be equal to 1, which means that the Fβ function becomes F1 = 2πρ , i.e. the harmonic mean of precision and recall. Note that for the trivial rejector, π = 1 andρ = 0, so Fβ = 0 for any value of β (symmetrically, for the trivial acceptorit is true that π = 0, ρ = 1, and Fβ = 0 for any value of β).
Finally, it should be noted that some applications of TC require cost-based issued to be brought to bear on how eﬀectiveness is computed, thus inducinga utility-theoretic notion of eﬀectiveness. For instance, in spam ﬁltering (i.e.
a binary TC task in which e-mail messages must be classiﬁed in the categorySpam or its complement NonSpam), precision is more important than recall,since ﬁling a legitimate message under Spam is a more serious error (i.e. itbears more cost) than ﬁling a junk message under NonSpam. One possibleway of taking this into account is using the Fβ measure with β = 1; usingvalues of 0 ≤ β < 1 corresponds to paying more attention to precision thanto recall, while by using values of 0 < β < ∞ one emphasizes recall at theexpense of precision.
3 Techniques
We now discuss some of the actual techniques for dealing with the prob-lems of document indexing and classiﬁer learning, discussed in the previoussection. Presenting a complete review of them is outside the scope of thischapter; as a consequence, we will only hint at the various choices that areavailable to the designer, and will enter into some detail only for a few 3.1 Document indexing techniques
The TC community has not displayed much creativity in devising documentweighting techniques speciﬁc to TC. In fact, most of the works reported inthe TC literature so far use the standard document weighting techniques,either of a statistical or of a probabilistic nature, which are used in all othersubﬁelds of IR, including text search (e.g. tf idf or BM25 – see [5]). Theonly exception to this we know is [8], where the idf component in tf idf isreplaced by a function learnt from training data, and aimed at assessinghow good a term is at discriminating categories from each other.
Also in TC, as in other subﬁelds of IR, the use of larger indexing units, such as frequently adjacent pairs (aka “bigrams”) or syntactically deter-mined phrases, has not shown systematic patterns of improvement [4, 9],which means that terms are usually made to coincide with single words,stemmed or not.
Dimensionality reduction is tackled either by feature selection techniques, such as mutual information (aka information gain) [10], chi square [11], orgain ratio [8], or by feature extraction techniques, such as latent semanticindexing [12, 13] or term clustering [9]. Recent work on term extractionmethods has focused on methods speciﬁc to TC (or rather: speciﬁc to prob-lems in which training data exist), i.e. on supervised term clustering tech-niques [14, 15, 16], which have shown better performance that the previouslymentioned unsupervised techniques.
3.2 Classiﬁer learning techniques
The number of classes of classiﬁer learning techniques that have been usedin TC is bewildering. These include at the very least probabilistic methods,regression methods, decision tree and decision rule learners, neural networks,batch and incremental learners of linear classiﬁers, example-based meth-ods, support vector machines, genetic algorithms, hidden Markov models,and classiﬁer committees (which include boosting methods). Rather thanattempting to say even a few words about each of them, we will introducein some detail two of them, namely support vector machines and boosting.
The reasons for this choice are twofold. First, these are the two methodsthat have unquestionably shown the best performance in comparative TCexperiments performed so far. Second, these are the newest methods in theclassiﬁer learning arena, and the ones with the strongest justiﬁcations fromcomputational learning theory.
3.2.1 Support vector machines
The support vector machine (SVM) method has been introduced in TC by
Joachims [17, 18] and subsequently used in several other TC works [19, 20,
21]. In geometrical terms, it may be seen as the attempt to ﬁnd, amongall the surfaces σ1, σ2, . . . in |T |-dimensional space that separate the pos-itive from the negative training examples (decision surfaces), the σi thatseparates the positives from the negatives by the widest possible margin,i.e. such that the minimal distance between the hyperplane and a trainingexample is maximum; results in computational learning theory indicate thatthis tends to minimize the generalization error, i.e. the error of the resultingclassiﬁer on yet unseen examples. SVMs were usually conceived for binaryclassiﬁcation problems [22], and only recently they have been adapted tomulticlass classiﬁcation [1].
As argued by Joachims [17], one advantage that SVMs oﬀer for TC is that dimensionality reduction is usually not needed, as SVMs tend to befairly robust to overﬁtting and can scale up to considerable dimensionalities.
Recent extensive experiments by Brank and colleagues [23] also indicate thatfeature selection tends to be detrimental to the performance of SVMs.
Recently, eﬃcient algorithms for SVM learning have also been discovered; as a consequence, the use of SVMs for high-dimensional problems as TC isno more prohibitive for the point of view of computational cost.
There are currently several freely available packages for SVM learning.
The best known in the binary TC camp is the SvmLight package6, whileone that has been extended to also deal with the general single-label classi-ﬁcation problem is 3.2.2 Boosting
Classiﬁer committees (aka ensembles) are based on the idea that k diﬀerent
classiﬁers Φ1, . . . , Φk may be better than one if their individual judgments
are appropriately combined. In the boosting method [24, 25, 26, 27] the
k classiﬁers Φ1, . . . , Φk are obtained by the same learning method (here
called the weak learner ), and are trained not in a conceptually parallel
and independent way, but sequentially. In this way, in training classiﬁer
Φt one may take into account how classiﬁers Φ1, . . . , Φt−1 perform on the
training examples, and concentrate on getting right those examples on which
Φ1, . . . , Φt−1 have performed worst.
Speciﬁcally, for learning classiﬁer Φt each dj, ci pair is given an “impor- tance weight” ht (where h1 is set to be equal for all d represents how hard to get a correct decision for this pair was for classi-ﬁers Φ1, . . . , Φt−1. These weights are exploited in learning Φt, which will bespecially tuned to correctly solve the pairs with higher weight. Classiﬁer Φtis then applied to the training documents, and as a result weights ht are updated to ht+1; in this update operation, pairs correctly classiﬁed by Φ will have their weight decreased, while pairs misclassiﬁed by Φt will have 6SVMlight is available from http://svmlight.joachims.org/7BSVM is available from http://www.csie.ntu.edu.tw/~cjlin/bsvm/ their weight increased. After all the k classiﬁers have been built, a weightedlinear combination rule is applied to yield the ﬁnal committee.
Boosting has proven a powerful intuition, and the BoosTexter system8 has reached one of the highest levels of eﬀectiveness reported in the literatureso far.
4 Applications
As mentioned in Section 1, the applications of TC are manifold. Commontraits among all of them are • The need to handle and organize documents in which the textual component is either the unique, or dominant, or simplest to interpret,component.
• The need to handle and organize large quantities of such documents, i.e. large enough that their manual organization into classes is eithertoo expensive or not feasible within the time constraints imposed bythe application.
• The fact that the set of categories is known in advance, and its vari- Applications may instead vary along several dimensions: • The nature of the documents; i.e. documents may be structured texts (such as e.g. scientiﬁc articles), newswire stories, classiﬁed ads, imagecaptions, e-mail messages, transcripts of spoken texts, hypertexts, orother.
If the documents are hypertextual, rather than textual, very diﬀerenttechniques may be used, since links provide a rich source of infor-mation on which classiﬁer learning activity can leverage. Techniquesexploiting this intuition in a TC context have been presented in [28,29, 30, 31] and experimentally compared in [32].
• The structure of the classiﬁcation scheme, i.e. whether this is ﬂat or hierarchical. Hierarchical classiﬁcation schemes may in turn betree-shaped, or allow for multiple inheritance (i.e. be DAG-shaped).
Again, the hierarchical structure of the classiﬁcation scheme may allowradically more eﬃcient, and more eﬀective too, classiﬁcation algo-rithms, which can take advantage of early subtree pruning [33, 21, 34],improved selection of negative examples [35], or improved estimationof word occurrence statistics in leaf nodes [36, 37, 38, 39].
8BoosTexter is available from http://www.cs.princeton.edu/~schapire/boostexter.html9In practical applications, the set of categories does change from time to time. For instance, in indexing computer science scientiﬁc articles under the ACM classiﬁcationscheme, one needs to consider that this scheme is revised every ﬁve to ten years, toreﬂect changes in the CS discipline. This means that training documents need to becreated for newly introduced categories, and that training documents may have to beremoved for categories whose meaning has evolved.
• The nature of the task, i.e. whether the task is single-label or multi- Hereafter, we brieﬂy review some important applications of TC. Note thatthe borders between the diﬀerent classes of applications listed here are fuzzy,and some of these may be considered special cases of others.
4.1 Automatic indexing for Boolean information retrieval systems
The application that has stimulated the research in TC from its very begin-ning, back in the ‘60s, to the ‘80s is that of automatic indexing of scientiﬁcarticles by means of a controlled dictionary, such as the ACM ClassiﬁcationScheme, where the categories are the entries of the controlled dictionary.
This is typically a multi-label task, since several index terms are usuallyassigned to each document.
Automatic indexing with controlled dictionaries is closely related to the automated metadata generation task. In digital libraries one is usually inter-ested in tagging documents by metadata that describe them under a vari-ety of aspects (e.g. creation date, document type or format, availability,etc.). Some of these metadata are thematic, i.e. their role is to describethe semantics of the document by means of bibliographic codes, keywordsor keyphrases. The generation of these metadata may thus be viewed as aproblem of document indexing with controlled dictionary, and thus tack-led by means of TC techniques. In the case of Web documents, metadatadescribing them will be needed for the Semantic Web to become a reality,and TC techniques applied to Web data may be envisaged as contributingpart of the solution to the huge problem of generating the metadata neededby Semantic Web resources.
4.2 Document organization
Indexing with a controlled vocabulary is an instance of the general problemof document base organization. In general, many other issues pertaining todocument organization and ﬁling, be it for purposes of personal organizationor structuring of a corporate document base, may be addressed by TCtechniques. For instance, at the oﬃces of a newspaper, it might be necessaryto classify all past articles in order to ease future retrieval in the case of newevents related to the ones described by the past articles. Possible categoriesmight be HomeNews, International, Money, Lifestyles, Fashion, but also ﬁner-grained ones such as ThePittAnistonMarriage.
Another possible application in the same range is the organization of patents into categories for making later access easier, and of patent applica-tions for allowing patent oﬃcers to discover possible prior work on the sametopic [40]. This application, as all applications having to do with patentdata, introduces speciﬁc problems, since the description of the allegedlynovel technique, which is written by the patent applicant, may intentionally use non standard vocabulary in order to create the impression that the tech-nique is indeed novel. This use of non standard vocabulary may depress theperformance of a text classiﬁer, since the assumption that underlies practi-cally all TC work is that training documents and test documents are drawnfrom the same word distribution.
4.3 Text ﬁltering
Text ﬁltering is the activity of classifying a stream of incoming documentsdispatched in an asynchronous way by an information producer to an infor-mation consumer. Typical cases of ﬁltering systems are e-mail ﬁlters [41] (inwhich case the producer is actually a multiplicity of producers), newsfeed ﬁl-ters [42], or ﬁlters of unsuitable content [43]. A ﬁltering system should blockthe delivery of the documents the consumer is likely not interested in. Fil-tering is a case of binary TC, since it involves the classiﬁcation of incomingdocuments in two disjoint categories, the relevant and the irrelevant. Addi-tionally, a ﬁltering system may also further classify the documents deemedrelevant to the consumer into thematic categories of interest to the user.
A ﬁltering system may be installed at the producer end, in which case itmust route the documents to the interested consumers only, or at the con-sumer end, in which case it must block the delivery of documents deemeduninteresting to the consumer.
In information science document ﬁltering has a tradition dating back to the ’60s, when, addressed by systems of various degrees of automationand dealing with the multi-consumer case discussed above, it was calledselective dissemination of information or current awareness. The explosionin the availability of digital information has boosted the importance of suchsystems, which are nowadays being used in diverse contexts such as thecreation of personalized Web newspapers, junk e-mail blocking, and Usenetnews selection.
4.4 Hierarchical categorization of Web pages
TC has recently aroused a lot of interest also for its possible application toautomatically classifying Web pages, or sites, under the hierarchical cata-logues hosted by popular Internet portals. When Web documents are cat-alogued in this way, rather than issuing a query to a general-purpose Websearch engine a searcher may ﬁnd it easier to ﬁrst navigate in the hierar-chy of categories and then restrict her search to a particular category ofinterest. Classifying Web pages automatically has obvious advantages, sincethe manual categorization of a large enough subset of the Web is infeasi-ble. With respect to previously discussed TC applications, automatic Webpage categorization has two essential peculiarities (both discussed in Sec-tion 4), namely the hypertextual nature of the documents, and the typicallyhierarchical structure of the category set.
4.5 Word sense disambiguation
Word sense disambiguation (WSD) is the activity of ﬁnding, given theoccurrence in a text of an ambiguous (i.e. polysemous or homonymous)word, the sense of this particular word occurrence. For instance, bank mayhave (at least) two diﬀerent senses in English, as in the Bank of England(a ﬁnancial institution) or the bank of river Thames (a hydraulic engi-neering artifact). It is thus a WSD task to decide which of the above sensesthe occurrence of bank in Last week I borrowed some money from thebank has. WSD may be seen as a (single-label) TC task (see e.g. [44]) once,given a word w, we view the contexts of occurrence of w as documents andthe senses of w as categories.
4.6 Automated survey coding
Survey coding is the task of assigning a symbolic code from a predeﬁnedset of such codes to the answer that a person has given in response to anopen-ended question in a questionnaire (aka survey). This task is usuallycarried out in order to group respondents according to a predeﬁned schemebased on their answers. Survey coding has several applications, especially inthe social sciences, where the classiﬁcation of respondents is functional tothe extraction of statistics on political opinions, health and lifestyle habits,customer satisfaction, brand ﬁdelity, and patient satisfaction.
Survey coding is a diﬃcult task, since the code that should be attributed to a respondent based on the answer she has given is a matter of subjectivejudgment, and thus requires expertise. The problem can be formulated as asingle-label TC problem [45], where the answers play the role of the docu-ments, and the codes that are applicable to the answers returned to a givenquestion play the role of the categories (diﬀerent questions thus correspondto diﬀerent TC problems).
4.7 Automated authorship attribution and genre classiﬁcation
Authorship attribution is the task of determining the author of a text ofdisputed or unknown paternity, choosing from a predeﬁned set of candidateauthors [46, 47, 48]. Authorship attribution has several applications, rangingfrom the literary (e.g. discovering who the author of a recently discoveredsonnet is) to the forensic (e.g. identifying the sender of an anonymous letter,or checking the authenticity of a letter allegedly authored by a given person).
Authorship attribution can also be seen as a single-label TC task, withpossible authors playing the role of the categories. This is an applicationin which a TC system typically cannot be taken at face value; usually, itsresult contributes an “opinion” on who the possible author might be, butthe ﬁnal decision has to be taken by a human professional. As a result, aTC system that ranks the candidate authors in terms of their probability of being the true author, would be useful (see Section 2.2).
The intuitions that must be brought to bear in these applications are orthogonal to those that are at play in topic-based classiﬁcation, since anauthor normally writes about multiple topics. Because of this, it is unlikelythat topic-based features can be good at discriminating among authors.
Rather, stylistic features are the most appropriate choice; for instance,vocabulary richness (i.e. ratio between number of distinct words and totalnumber of words), average word length, average sentence length, are impor-tant, in the sense that it is these features that tend “to give an authoraway”.
Genre classiﬁcation is also an applicative context which bears remark- able similarities to authorship attribution. There are applicative contextsin which it is desirable to classify documents by genre, rather than bytopic [49, 50, 51, 52]. For instance, it might be desirable to classify arti-cles about scientiﬁc subjects into one of the two categories PopularScienceand HardScience, in order to decide whether they are suitable for publica-tion into popular science magazines or not; likewise, distinguishing betweenProductReviews and Advertisements might be useful for several applications.
In genre classiﬁcation too, topic-dependent words are not good separatingfeatures, and specialized features need to be devised, which are often similarto the ones used for authorship attribution applications.
4.8 Spam ﬁltering
Filtering spam (i.e. unsolicited bulk e-mail) is a task of increased applicativeinterest that lies at the crossroads between ﬁltering and genre classiﬁca-tion. In fact, it has the dynamical character of other ﬁltering applications,such as e-mail ﬁltering, and it cuts across diﬀerent topics, as genre classi-ﬁcation. Several attempts, some of them quite successful, have been madeat applying standard text classiﬁcation techniques to spam ﬁltering, forapplications involving either personal mail [53, 19, 54] or mailing lists [55].
However, operational spam ﬁlters must rely not only on standard ML tech-niques, but also on manually selected features. In fact, similarly to the caseof genre classiﬁcation or authorship attribution, it is the stylistic (i.e. topic-neutral) features that are important, rather than the topic-based ones. Infact, spam deals with a multiplicity of topics (from miraculous money mak-ing schemes to Viagra pills), and cues indicative of topics can hardly beeﬀective unless they are supplemented with other topic-neutral ones. Onthe contrary, a human eye may immediately recognize a spam message fromvisual cues, such as e.g. the amount of all-caps words in the subject line orin the text of the message, the number of exclamation marks in the sub-ject line, an unknown sender with an unknown Web e-mail address (e.g. [email protected]), or even the peculiar formatting of the messagebody. Representing these visual cues (as well as taking into accout otherstandard phrases such as “Make money fast!”) as features is important to the eﬀectiveness of an operational spam ﬁlter.
One further problem that makes spam ﬁltering challenging is the frequent unavailability of negative training messages. A software maker wishing tocustomize its spam ﬁlter for a particular client needs training examples;while positive ones (i.e. spam messages) are not hard to collect in large quan-tities, negative ones (i.e. legitimate messages) are diﬃcult to ﬁnd, becauseof privacy issues, since a company dealing with industrially sensitive datawill not disclose samples of their own incoming legitimate messages even tosomeone who is going to use these messages for improving a service to them.
In this case, ML methods that can do without negative examples need tobe used.
4.9 Other applications
The applications described above are just the major among the ones TChas been used for. Here, we only brieﬂy hint at a few other ones.
Myers and colleagues [56], and Schapire and Singer [25] have attacked speech categorization by means of a combination of speech recognition andTC, in the context of a phone call routing application. Sable and Hatzi-vassiloglou classify instead images through the classiﬁcation of their textualcaptions [57]. Larkey [58] instead uses TC to tackle automated essay grad-ing, where the diﬀerent grades that can be attributed to an essay play therole of categories. In a question answering application, Li and Roth [59] clas-sify questions according to question type; this allows a question answeringsystem to focus on locating the right type of information for the right typeof question, thus improving the eﬀectiveness of the overall system.
5 Conclusion
Text categorization has evolved, from the neglected research niche it usedto be until the late ‘80s, into a fully blossomed research ﬁeld which hasdelivered eﬃcient, eﬀective, and overall workable solutions that have beenused in tackling a wide variety of real-world application domains. Key tothis success have been (i) the ever-increasing involvement of the machinelearning community in text categorization, which has lately resulted in theuse of the very latest machine learning technology within text categoriza-tion applications, and (ii) the availability of standard benchmarks (such asReuters-21578 and OHSUMED), which has encouraged research by provid-ing a setting in which diﬀerent research eﬀorts could be compared to eachother, and in which the best methods and algorithms could stand out.
Currently, text categorization research is pointing in several interesting directions. One of them is the attempt at ﬁnding better representations fortext; while the bag of words model is still the unsurpassed text representa-tion model, researchers have not renounced to the belief that a text mustbe something more that a mere collection of tokens, and that the quest for models more sophisticated than the bag of words model is still worthpursuing [60].
A further direction is investigating the scalability properties of text clas- siﬁcation systems, i.e. understanding whether the systems that have provedthe best in terms of eﬀectiveness alone stand up to the challenge of dealingwith very large numbers of categories (e.g. in the tens of thousands) [61].
Last but not least are the attempts at solving the labeling bottleneck, i.e.
at coming to terms with the fact that labeling examples for training a textclassiﬁer when labeled examples do not previously exist, is expensive. As aresult, there is increasing attention in text categorization by semi-supervisedmachine learning methods, i.e. by methods that can bootstrap oﬀ a smallset of labeled examples and leverage on unlabeled examples too [62].
References
[1] Crammer, K. & Singer, Y., On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning
Research, 2, pp. 265–292, 2001.
[2] Sebastiani, F., Machine learning in automated text categorization.
ACM Computing Surveys, 34(1), pp. 1–47, 2002.
[3] Frakes, W.B., Stemming algorithms. Information Retrieval: Data Structures and Algorithms, eds. W.B. Frakes & R. Baeza-Yates, Pren-tice Hall: Englewood Cliﬀs, US, pp. 131–160, 1992.
[4] Caropreso, M.F., Matwin, S. & Sebastiani, F., A learner-independent evaluation of the usefulness of statistical phrases for automated textcategorization. Text Databases and Document Management: Theoryand Practice, ed. A.G. Chin, Idea Group Publishing: Hershey, US, pp.
78–102, 2001.
[5] Zobel, J. & Moﬀat, A., Exploring the similarity space. SIGIR Forum, 32(1), pp. 18–34, 1998.
[6] Salton, G. & Buckley, C., Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), pp. 513–523,
1988. Also reprinted in [63], pp. 323–328.
[7] Yang, Y., A study on thresholding strategies for text categoriza- tion. Proceedings of SIGIR-01, 24th ACM International Conference onResearch and Development in Information Retrieval, eds. W.B. Croft,D.J. Harper, D.H. Kraft & J. Zobel, ACM Press, New York, US: NewOrleans, US, pp. 137–145, 2001.
[8] Debole, F. & Sebastiani, F., Supervised term weighting for automated text categorization. Proceedings of SAC-03, 18th ACM Symposium onApplied Computing, ACM Press, New York, US: Melbourne, US, pp.
784–788, 2003.
[9] Lewis, D.D., An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of SIGIR-92, 15th ACMInternational Conference on Research and Development in Informa- tion Retrieval, eds. N.J. Belkin, P. Ingwersen & A.M. Pejtersen, ACMPress, New York, US: Kobenhavn, DK, pp. 37–50, 1992.
[10] Lewis, D.D. & Ringuette, M., A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, 3rd Annual Sympo-sium on Document Analysis and Information Retrieval, Las Vegas, US,pp. 81–93, 1994.
[11] Yang, Y. & Pedersen, J.O., A comparative study on feature selection in text categorization. Proceedings of ICML-97, 14th International Con-ference on Machine Learning, ed. D.H. Fisher, Morgan Kaufmann Pub-lishers, San Francisco, US: Nashville, US, pp. 412–420, 1997.
[12] Wiener, E.D., Pedersen, J.O. & Weigend, A.S., A neural network approach to topic spotting. Proceedings of SDAIR-95, 4th Annual Sym-posium on Document Analysis and Information Retrieval, Las Vegas,US, pp. 317–332, 1995.
utze, H., Hull, D.A. & Pedersen, J.O., A comparison of classiﬁers and document representations for the routing problem. Proceedingsof SIGIR-95, 18th ACM International Conference on Research andDevelopment in Information Retrieval, eds. E.A. Fox, P. Ingwersen &R. Fidel, ACM Press, New York, US: Seattle, US, pp. 229–237, 1995.
[14] Baker, L.D. & McCallum, A.K., Distributional clustering of words for text classiﬁcation. Proceedings of SIGIR-98, 21st ACM InternationalConference on Research and Development in Information Retrieval,eds. W.B. Croft, A. Moﬀat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel,ACM Press, New York, US: Melbourne, AU, pp. 96–103, 1998.
[15] Bekkerman, R., El-Yaniv, R., Tishby, N. & Winter, Y., On feature distributional clustering for text categorization. Proceedings of SIGIR-01, 24th ACM International Conference on Research and Developmentin Information Retrieval, eds. W.B. Croft, D.J. Harper, D.H. Kraft &J. Zobel, ACM Press, New York, US: New Orleans, US, pp. 146–153,2001.
[16] Slonim, N. & Tishby, N., The power of word clusters for text classiﬁ- cation. Proceedings of ECIR-01, 23rd European Colloquium on Infor-mation Retrieval Research, Darmstadt, DE, 2001.
[17] Joachims, T., Text categorization with support vector machines: learn- ing with many relevant features. Proceedings of ECML-98, 10th Euro-pean Conference on Machine Learning, eds. C. N´edellec & C. Rouveirol,Springer Verlag, Heidelberg, DE: Chemnitz, DE, pp. 137–142, 1998.
Published in the “Lecture Notes in Computer Science” series, number1398.
[18] Joachims, T., Transductive inference for text classiﬁcation using sup- port vector machines. Proceedings of ICML-99, 16th International Con-ference on Machine Learning, eds. I. Bratko & S. Dzeroski, MorganKaufmann Publishers, San Francisco, US: Bled, SL, pp. 200–209, 1999.
[19] Drucker, H., Vapnik, V. & Wu, D., Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), pp.
[20] Dumais, S.T., Platt, J., Heckerman, D. & Sahami, M., Inductive learn- ing algorithms and representations for text categorization. Proceedingsof CIKM-98, 7th ACM International Conference on Information andKnowledge Management, eds. G. Gardarin, J.C. French, N. Pissinou,K. Makki & L. Bouganim, ACM Press, New York, US: Bethesda, US,pp. 148–155, 1998.
[21] Dumais, S.T. & Chen, H., Hierarchical classiﬁcation of Web con- tent. Proceedings of SIGIR-00, 23rd ACM International Conference onResearch and Development in Information Retrieval, eds. N.J. Belkin,P. Ingwersen & M.K. Leong, ACM Press, New York, US: Athens, GR,pp. 256–263, 2000.
[22] Vapnik, V.N., The nature of statistical learning theory. Springer Verlag: action of feature selection methods and linear classiﬁcation models.
Proceedings of the ICML-02 Workshop on Text Learning, Sydney, AU,2002.
[24] Schapire, R.E., Singer, Y. & Singhal, A., Boosting and Rocchio applied to text ﬁltering. Proceedings of SIGIR-98, 21st ACM InternationalConference on Research and Development in Information Retrieval,eds. W.B. Croft, A. Moﬀat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel,ACM Press, New York, US: Melbourne, AU, pp. 215–223, 1998.
[25] Schapire, R.E. & Singer, Y., BoosTexter: a boosting-based system for text categorization. Machine Learning, 39(2/3), pp. 135–168, 2000.
[26] Sebastiani, F., Sperduti, A. & Valdambrini, N., An improved boosting algorithm and its application to automated text categorization. Pro-ceedings of CIKM-00, 9th ACM International Conference on Informa-tion and Knowledge Management, eds. A. Agah, J. Callan & E. Run-densteiner, ACM Press, New York, US: McLean, US, pp. 78–85, 2000.
[27] Nardiello, P., Sebastiani, F. & Sperduti, A., Discretizing continuous attributes in AdaBoost for text categorization. Proceedings of ECIR-03,25th European Conference on Information Retrieval, ed. F. Sebastiani,Springer Verlag: Pisa, IT, pp. 320–334, 2003.
[28] Chakrabarti, S., Dom, B.E. & Indyk, P., Enhanced hypertext cate- gorization using hyperlinks. Proceedings of SIGMOD-98, ACM Inter-national Conference on Management of Data, eds. L.M. Haas &A. Tiwary, ACM Press, New York, US: Seattle, US, pp. 307–318, 1998.
[29] Oh, H.J., Myaeng, S.H. & Lee, M.H., A practical hypertext catego- rization method using links and incrementally available class informa-tion. Proceedings of SIGIR-00, 23rd ACM International Conference onResearch and Development in Information Retrieval, eds. N.J. Belkin,P. Ingwersen & M.K. Leong, ACM Press, New York, US: Athens, GR,pp. 264–271, 2000.
[30] Slattery, S. & Craven, M., Discovering test set regularities in relational domains. Proceedings of ICML-00, 17th International Conference onMachine Learning, ed. P. Langley, Morgan Kaufmann Publishers, SanFrancisco, US: Stanford, US, pp. 895–902, 2000.
[31] Getoor, L., Segal, E., Taskar, B. & Koller, D., Probabilistic models of text and link structure for hypertext classiﬁcation. Proceedings of theIJCAI-01 Workshop on Text Learning: Beyond Supervision, Seattle,US, pp. 24–29, 2001.
[32] Yang, Y., Slattery, S. & Ghani, R., A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3), pp.
219–241, 2002. Special Issue on Automated Text Categorization.
[33] Chakrabarti, S., Dom, B.E., Agrawal, R. & Raghavan, P., Scalable feature selection, classiﬁcation and signature generation for organizing
large text databases into hierarchical topic taxonomies. Journal of Very
Large Data Bases, 7(3), pp. 163–178, 1998.
[34] Koller, D. & Sahami, M., Hierarchically classifying documents using very few words. Proceedings of ICML-97, 14th International Conferenceon Machine Learning, ed. D.H. Fisher, Morgan Kaufmann Publishers,San Francisco, US: Nashville, US, pp. 170–178, 1997.
[35] Ng, H.T., Goh, W.B. & Low, K.L., Feature selection, perceptron learn- ing, and a usability case study for text categorization. Proceedings ofSIGIR-97, 20th ACM International Conference on Research and Devel-opment in Information Retrieval, eds. N.J. Belkin, A.D. Narasimhalu& P. Willett, ACM Press, New York, US: Philadelphia, US, pp. 67–73,1997.
E., Goutte, C., Popat, K. & Chen, F., A hierarchical model for clustering and categorising documents. Proceedings of ECIR-02,24th European Colloquium on Information Retrieval Research, eds.
F. Crestani, M. Girolami & C.J.V. Rijsbergen, Springer Verlag, Heidel-berg, DE: Glasgow, UK, pp. 229–247, 2002. Published in the “LectureNotes in Computer Science” series, number 2291.
[37] McCallum, A.K., Rosenfeld, R., Mitchell, T.M. & Ng, A.Y., Improving text classiﬁcation by shrinkage in a hierarchy of classes. Proceedingsof ICML-98, 15th International Conference on Machine Learning, ed.
J.W. Shavlik, Morgan Kaufmann Publishers, San Francisco, US: Madi-son, US, pp. 359–367, 1998.
[38] Toutanova, K., Chen, F., Popat, K. & Hofmann, T., Text classiﬁcation in a hierarchical mixture model for small training sets. Proceedingsof CIKM-01, 10th ACM International Conference on Information andKnowledge Management, eds. H. Paques, L. Liu & D. Grossman, ACMPress, New York, US: Atlanta, US, pp. 105–113, 2001.
[39] Vinokourov, A. & Girolami, M., A probabilistic framework for the hier- archic organisation and classiﬁcation of document collections. Journal
of Intelligent Information Systems, 18(2/3), pp. 153–172, 2002. Special
Issue on Automated Text Categorization.
[40] Larkey, L.S., A patent search and classiﬁcation system. Proceedings of DL-99, 4th ACM Conference on Digital Libraries, eds. E.A. Fox &N. Rowe, ACM Press, New York, US: Berkeley, US, pp. 179–187, 1999.
[41] Weiss, S.M., Apt´e, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T. & Hampp, T., Maximizing text-mining performance. IEEE Intelli-
gent Systems, 14(4), pp. 63–69, 1999.
[42] Amati, G., D’Aloisi, D., Giannini, V. & Ubaldini, F., A framework for ﬁltering news and managing distributed data. Journal of Universal
Computer Science, 3(8), pp. 1007–1021, 1997.
[43] Chandrinos, K.V., Androutsopoulos, I., Paliouras, G. & Spyropou- los, C.D., Automatic Web rating: Filtering obscene content on theWeb. Proceedings of ECDL-00, 4th European Conference on Researchand Advanced Technology for Digital Libraries, eds. J.L. Borbinha &T. Baker, Springer Verlag, Heidelberg, DE: Lisbon, PT, pp. 403–406,2000. Published in the “Lecture Notes in Computer Science” series,number 1923.
arquez, L. & Rigau, G., Boosting applied to word sense disambiguation. Proceedings of ECML-00, 11th European Conferenceon Machine Learning, eds. R.L.D. M´ lag, Heidelberg, DE: Barcelona, ES, pp. 129–141, 2000. Published inthe “Lecture Notes in Computer Science” series, number 1810.
[45] Giorgetti, D. & Sebastiani, F., Automating survey coding by multiclass text categorization techniques. Journal of the American Society forInformation Science and Technology, 2003. Forthcoming.
[46] Vel, O.Y.D., Anderson, A., Corney, M. & Mohay, G.M., Mining email content for author identiﬁcation forensics. SIGMOD Record, 30(4), pp.
55–64, 2001.
[47] Forsyth, R.S., New directions in text categorization. Causal models and intelligent data management, ed. A. Gammerman, Springer Verlag:Heidelberg, DE, pp. 151–185, 1999.
[48] Diederich, J., Kindermann, J., Leopold, E. & Paaß, G., Authorship attribution with support vector machines. Applied Intelligence, 19(1/2),
pp. 109–123, 2003.
[49] Finn, A., Kushmerick, N. & Smyth, B., Genre classiﬁcation and domain transfer for information ﬁltering. Proceedings of ECIR-02, 24th Euro-pean Colloquium on Information Retrieval Research, eds. F. Crestani,M. Girolami & C.J.V. Rijsbergen, Springer Verlag, Heidelberg, DE:Glasgow, UK, pp. 353–362, 2002. Published in the “Lecture Notes inComputer Science” series, number 2291.
[50] Kessler, B., Nunberg, G. & Sch¨ genre. Proceedings of ACL-97, 35th Annual Meeting of the Associationfor Computational Linguistics, eds. P.R. Cohen & W. Wahlster, Mor-gan Kaufmann Publishers, San Francisco, US: Madrid, ES, pp. 32–38,1997.
[51] Lee, Y.B. & Myaeng, S.H., Text genre classiﬁcation with genre- revealing and subject-revealing features. Proceedings of SIGIR-02, 25th ACM International Conference on Research and Development in Infor-mation Retrieval, eds. M. Beaulieu, R. Baeza-Yates, S.H. Myaeng &K. J¨ arvelin, ACM Press, New York, US: Tampere, FI, pp. 145–150, [52] Stamatatos, E., Fakotakis, N. & Kokkinakis, G., Automatic text cat- egorization in terms of genre and author. Computational Linguistics,
26(4), pp. 471–495, 2000.
[53] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V. & Spyropoulos, C.D., An experimental comparison of naive Bayesian and keyword-based anti-spam ﬁltering with personal e-mail messages. Proceedingsof SIGIR-00, 23rd ACM International Conference on Research andDevelopment in Information Retrieval, eds. N.J. Belkin, P. Ingwersen& M.K. Leong, ACM Press, New York, US: Athens, GR, pp. 160–167,2000.
omez-Hidalgo, J.M., Evaluating cost-sensitive unsolicited bulk email categorization. Proceedings of SAC-02, 17th ACM Symposium onApplied Computing, Madrid, ES, pp. 615–620, 2002.
[55] Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spy- ropoulos, C.D. & Stamatopoulos, P., A memory-based approach to
anti-spam ﬁltering for mailing lists. Information Retrieval, 6(1), pp.
49–73, 2003.
[56] Myers, K., Kearns, M., Singh, S. & Walker, M.A., A boosting approach to topic spotting on subdialogues. Proceedings of ICML-00, 17th Inter-national Conference on Machine Learning, ed. P. Langley, MorganKaufmann Publishers, San Francisco, US: Stanford, US, pp. 655–662,2000.
[57] Sable, C.L. & Hatzivassiloglou, V., Text-based approaches for non- topical image categorization. International Journal of Digital Libraries,
3(3), pp. 261–275, 2000.
[58] Larkey, L.S., Automatic essay grading using text categorization tech- niques. Proceedings of SIGIR-98, 21st ACM International Conferenceon Research and Development in Information Retrieval, eds. W.B.
Croft, A. Moﬀat, C.J.V. Rijsbergen, R. Wilkinson & J. Zobel, ACMPress, New York, US: Melbourne, AU, pp. 90–95, 1998.
[59] Li, X. & Roth, D., Learning question classiﬁers. Proceedings of COLING-02, 19th International Conference on Computational Lin-guistics, Taipei, TW, 2002.
[60] Koster, C.H. & Seutter, M., Taming wild phrases. Proceedings of ECIR- 03, 25th European Conference on Information Retrieval, ed. F. Sebas-tiani, Springer Verlag: Pisa, IT, pp. 161–176, 2003.
[61] Yang, Y., A scalability analysis of classiﬁers in text categorization. Pro- ceedings of SIGIR-03, 26th ACM International Conference on Researchand Development in Information Retrieval, ACM Press, New York, US:Toronto, CA, 2003.
[62] Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T.M., Text clas- siﬁcation from labeled and unlabeled documents using EM. Machine
Learning, 39(2/3), pp. 103–134, 2000.
[63] Sparck Jones, K. & Willett, P., (eds.) Readings in information retrieval.
Morgan Kaufmann: San Mateo, US, 1997.

Source: http://lvk.cs.msu.su/~bruzz/articles/classification/text-categorization.pdf

Microsoft word - d6.5 final v1.2.doc

Good Practice in Traditional Chinese Medicine Research in the Post-genomic Era Report on the reviewed literature relating to clinical use of Document description Report on the reviewed literature relating to clinical use of CHM This document is a summary of the recent research into Chinese herbal medicine in selected conditions. Andrew Flower, George Lewith and Dan Jaing – edit

Aspirin

MSDS Number: A7686 * * * * * Effective Date: 11/02/01 * * * * * Supercedes: 11/17/99 1. Product Identification Synonyms: 2-(acetyloxy)benzoic acid; salicylic acid acetate; acetysalicylic acid CAS No.: 50-78-2 Molecular Weight: 180.15 Chemical Formula: C9H8O4 Product Codes: 0033 2. Composition/Information on Ingredients ---------------------------------------