Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • harvard-cite-them-right
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
With or without context: Automatic text categorization using semantic kernels
Högskolan i Borås, Akademin för bibliotek, information, pedagogik och IT.
2016 (engelsk)Doktoravhandling, monografi (Annet vitenskapelig)
Abstract [en]

In this thesis text categorization is investigated in four dimensions of analysis: theoretically as well as empirically, and as a manual as well as a machine-based process. In the first four chapters we look at the theoretical foundation of subject classification of text documents, with a certain focus on classification as a procedure for organizing documents in libraries. A working hypothesis used in the theoretical analysis is that classification of documents is a process that involves translations between statements in different languages, both natural and artificial. We further investigate the close relationships between structures in classification languages and the order relations and topological structures that arise from classification.

A classification algorithm that gets a special focus in the subsequent chapters is the support vector machine (SVM), which in its original formulation is a binary classifier in linear vector spaces, but has been extended to handle classification problems for which the categories are not linearly separable. To this end the algorithm utilizes a category of functions called kernels, which induce feature spaces by means of high-dimensional and often non-linear maps. For the empirical part of this study we investigate the classification performance of semantic kernels generated by different measures of semantic similarity. One category of such measures is based on the latent semantic analysis and the random indexing methods, which generates term vectors by using co-occurrence data from text collections. Another semantic measure used in this study is pointwise mutual information. In addition to the empirical study of semantic kernels we also investigate the performance of a term weighting scheme called divergence from randomness, that has hitherto received little attention within the area of automatic text categorization.

The result of the empirical part of this study shows that the semantic kernels generally outperform the “standard” (non-semantic) linear kernel, especially for small training sets. A conclusion that can be drawn with respect to the investigated datasets is therefore that semantic information in the kernel in general improves its classification performance, and that the difference between the standard kernel and the semantic kernels is particularly large for small training sets. Another clear trend in the result is that the divergence from randomness weighting scheme yields a classification performance surpassing that of the common tf-idf weighting scheme.

sted, utgiver, år, opplag, sider
Högskolan i Borås, 2016. , s. 300
Serie
Skrifter från Valfrid, ISSN 1103-6990 ; 60
Emneord [en]
automatic text categorization, subject classification, machine learning, computational linguistics, support vector machines, semantic kernels, term weighting, divergence from randomness
HSV kategori
Forskningsprogram
Biblioteks- och informationsvetenskap
Identifikatorer
URN: urn:nbn:se:hb:diva-8949ISBN: 978-91-981654-8-7 (tryckt)ISBN: 978-91-981654-9-4 (tryckt)OAI: oai:DiVA.org:hb-8949DiVA, id: diva2:906045
Disputas
2016-04-15, C203, Allégatan 1, Borås, 13:00
Tilgjengelig fra: 2016-02-24 Laget: 2016-02-23 Sist oppdatert: 2025-09-24bibliografisk kontrollert

Open Access i DiVA

cover(4709 kB)347 nedlastinger
Filinformasjon
Fil COVER01.pdfFilstørrelse 4709 kBChecksum SHA-512
160e886aa5624f1a63d49a95f9b9689d0bba14dd5ef84bfec6fb954e7e5cf86356663cb1997bbe0ca4366c5e987da625688b7f7f47bda7b71fcbe27f1f6b273f
Type coverMimetype application/pdf
spikblad(45 kB)197 nedlastinger
Filinformasjon
Fil SPIKBLAD01.pdfFilstørrelse 45 kBChecksum SHA-512
178c1f93766219a7641c0a436fad95571bb588f763a46c563fc805ee799a6f6357ddc133134aaff1e29525c64184523d0c8b3decdddacde036894efd323665ea
Type spikbladMimetype application/pdf
fulltext(1388 kB)3060 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1388 kBChecksum SHA-512
6f124006cafb01bbe558c3b5c08f96460c1478e68c19369a49f77c67756b16f7ae5955a3cdb0dd898d82e94a0ff75303562c72edd09bea0e220cdb9636a31a9a
Type fulltextMimetype application/pdf

Person

Eklund, Johan

Søk i DiVA

Av forfatter/redaktør
Eklund, Johan
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 3062 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 15311 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • harvard-cite-them-right
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf