Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • harvard-cite-them-right
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Classification of Fiction Genres: Text classification of fiction texts from Project Gutenberg
Högskolan i Borås, Akademin för bibliotek, information, pedagogik och IT.
2018 (Engelska)Självständigt arbete på avancerad nivå (masterexamen), 20 poäng / 30 hpStudentuppsats (Examensarbete)
Abstract [en]

Stylometric analysis in text classification is most often used in authorship attribution studies. This thesis used a machine learning algorithm, the Naive Bayes Classifier, in a text classification task comparing stylometric and lexical features. The texts were extracted from the Project Gutenberg website and were comprised of three genres: detective fiction, fantasy, and science fiction. The aim was to see how well the classifier performed in a supervised learning task when it came to discerning genres from one another. R was used to extract the texts from Project Gutenberg and Python script was used to run the experiment. Approximately 1978 texts were extracted and preprocessed before univariate filtering and tf-idf weighting was used as the lexical feature while average sentence length, average word length, number of characters, number of punctuation marks, number of uppercase words, number of title case words, and parts-of-speech tags for nouns, verbs, and adjectives were generated as the feature sets for the topic independent stylometric features. Normalization was performed using the ℓ² norm for the tf-idf weighting, with the ℓ² norm and z-score standardization for the stylometric features. Multinomial Naive Bayes was performed on the lexical feature set and Gaussian Naive Bayeson the stylometric set, both with 10-fold cross-validation. Precision was used as the measure by which to assess the performance of the classifier. The classifier performed better in the lexical features experiment than the stylometric features experiment, suggesting that downsampling, more stylometric features, as well as more classes would have been beneficial.

Ort, förlag, år, upplaga, sidor
2018.
Nyckelord [sv]
text, classification, genre, machine, learning, supervised, Gutenberg, fiction
Nationell ämneskategori
Biblioteks- och informationsvetenskap
Identifikatorer
URN: urn:nbn:se:hb:diva-16007OAI: oai:DiVA.org:hb-16007DiVA, id: diva2:1305700
Ämne / kurs
Biblioteks- och informationsvetenskap
Tillgänglig från: 2019-04-24 Skapad: 2019-04-18 Senast uppdaterad: 2025-09-24Bibliografiskt granskad

Open Access i DiVA

fulltext(337 kB)4388 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 337 kBChecksumma SHA-512
c1fb6def37ac643542b4fd56386288f898d3dd9ad608bd1428f8904664f1afa7dbfcc240a2bb64168c8ae147982d34ab13c0e4b786f4afa0aa0c70ca293c8409
Typ fulltextMimetyp application/pdf

Av organisationen
Akademin för bibliotek, information, pedagogik och IT
Biblioteks- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 4398 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 2452 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • harvard-cite-them-right
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf