Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatisk genreklassifikation: en experimentell studie
University of Borås, Swedish School of Library and Information Science.
2008 (Swedish)Independent thesis Advanced level (degree of Master (One Year))Student thesisAlternative title
Automatic genre classification : an experimental study (English)
Abstract [en]

This thesis aims at examining to what extent a few, algorithmically very easily extractable document features can be used to classify electronic documents according to genre. A set of experiments is therefore carried out, using only 11 such simple features in an attempt to classify 84 documents belonging to electronic academic journals into three manually identified genres: table of contents, article, and review. The 11 features are also divided into three sets, containing metrics of words and sentences; punctuation marks; and URL links, respectively. The performance when using these sets of features is then measured with regard to classification accuracy, using a k-NN classifier, four different values of k (1, 3, 5, 7), and both leave-one-out and 10-fold cross-validation. Best results are achieved when using all three feature sets (i.e. all 11 features) and k=3, with an overall accuracy of 96% (81 of the 84 documents correctly classified), regardless of method for cross-validation. These results are significantly better than those of a referential baseline, conceived as the case where all instances would be guessed as belonging to the most populated class, with a corresponding accuracy of 49%. While not considered as disappointing in any way, the results are viewed by the author as perhaps an expression of a somewhat easy classification task. He therefore concludes by advocating further research on the capability of very simple features in contributing to accurate automatic genre classification, preferably by the use of experimental settings that are better suited to shed light on this matter.

Place, publisher, year, edition, pages
University of Borås/Swedish School of Library and Information Science (SSLIS) , 2008.
Series
Magisteruppsats i biblioteks- och informationsvetenskap vid institutionen Biblioteks- och informationsvetenskap, ISSN 1654-0247 ; 2008:36
Keywords [en]
automatisk genreklassifikation, genre, dokumentgenre, automatisk klassifikation, informationsåtervinning, ir-system
Keywords [sv]
maskininlärning
National Category
Social Sciences
Identifiers
URN: urn:nbn:se:hb:diva-18835Local ID: 2320/3582OAI: oai:DiVA.org:hb-18835DiVA, id: diva2:1310769
Note
Uppsatsnivå: DAvailable from: 2019-04-30 Created: 2019-04-30

Open Access in DiVA

fulltext(1523 kB)4 downloads
File information
File name FULLTEXT01.pdfFile size 1523 kBChecksum SHA-512
726f30444739f665a7aa405f1f3a5c726ecb1037d394b26eb9d1b92e4e860fdbc87f1615e5879d0a90af4feaa2398b06c7725947dc04c58c3e9d363e54952fad
Type fulltextMimetype application/pdf

By organisation
Swedish School of Library and Information Science
Social Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 4 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 3 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf