The LIVA research and development project (2005-2007) was conceived to integrate automatic indexing, automatic categorization, information visualization and information retrieval in library systems managing textual document collections. After a brief overview of some major information visualization methods, the user interface prototype is introduced.
The public sector is actively pursuing digital transformation to ensure continuous operations and relevance. While existing research has outlined essential prerequisites for successful digital transformation, there is recognition of willful ignorance concerning these prerequisites. Public servants may in other words deliberately avoid understanding the necessary conditions for digital transformation, often driven by strategic motives such as evading responsibility and/or accountability. The phenomenon of willful ignorance constitutes an important yet under-researched area within the study of digital government. To close this gap, we investigate the latent factors of willful ignorance in public sector digital transformation, utilizing three sets of national panel data focused on digital transformation prerequisites. Employing exploratory factor analysis on an initial sample, we construct a factor model, subsequently assessing its validity through confirmatory factor analysis on two additional samples. Our research identifies and validates latent factors associated with willful ignorance in the digital transformation of the public sector. Building on these findings, we propose a mid-range variance theory termed “digital transformation decoupling”. By integrating this theory with existing knowledge, we present a set of propositions to guide future research in the realm of public sector digital transformation.
In this thesis text categorization is investigated in four dimensions of analysis: theoretically as well as empirically, and as a manual as well as a machine-based process. In the first four chapters we look at the theoretical foundation of subject classification of text documents, with a certain focus on classification as a procedure for organizing documents in libraries. A working hypothesis used in the theoretical analysis is that classification of documents is a process that involves translations between statements in different languages, both natural and artificial. We further investigate the close relationships between structures in classification languages and the order relations and topological structures that arise from classification.
A classification algorithm that gets a special focus in the subsequent chapters is the support vector machine (SVM), which in its original formulation is a binary classifier in linear vector spaces, but has been extended to handle classification problems for which the categories are not linearly separable. To this end the algorithm utilizes a category of functions called kernels, which induce feature spaces by means of high-dimensional and often non-linear maps. For the empirical part of this study we investigate the classification performance of semantic kernels generated by different measures of semantic similarity. One category of such measures is based on the latent semantic analysis and the random indexing methods, which generates term vectors by using co-occurrence data from text collections. Another semantic measure used in this study is pointwise mutual information. In addition to the empirical study of semantic kernels we also investigate the performance of a term weighting scheme called divergence from randomness, that has hitherto received little attention within the area of automatic text categorization.
The result of the empirical part of this study shows that the semantic kernels generally outperform the “standard” (non-semantic) linear kernel, especially for small training sets. A conclusion that can be drawn with respect to the investigated datasets is therefore that semantic information in the kernel in general improves its classification performance, and that the difference between the standard kernel and the semantic kernels is particularly large for small training sets. Another clear trend in the result is that the divergence from randomness weighting scheme yields a classification performance surpassing that of the common tf-idf weighting scheme.
In this study, we investigate different strategies for assigning MeSH (Medical Subject Headings) terms to clinical guidelines using machine learning. Features based on words in titles and abstracts are investigated and compared to features based on topics assigned to references cited by the guidelines. Two of the feature engineering strategies utilize word embeddings produced by recent models based on in the distributional hypothesis, called word2vecand fastText. The evaluation results show that reference-based strategies tend to yield a higher recall and F1 scores for MeSH terms with a sufficient amount of training instances, whereas title and abstract based features yield a higher precision.
In this study, we have started to investigate how distinguishing the role of the cited reference from the subject of the cited reference can facilitate a more nuanced way to evaluate the citation context in the referring paper. Using natural language processing methods, we have developed methods to both enrich and distinguish specific traits in the aggregated citances. In future work we intend to extend the present analysis to a larger set of publications from the corpus and to cover more disciplines to be able to evaluate the results more precisely.
In this research in progress paper we report on preliminary results from the proposed novel uses of topic modelling approaches to bibliographic references as sources for “bags-of-words” instead of actual text content in scientometric settings. The actual cited references, viewed as concept symbols for paradigmatic approaches to earlier research, could thereby be used to cluster research. We will demonstrate an explorative approach to using cited reference topics for the discovery of hidden semantic reference structures in a set of scientific articles. If found fruitful and robust, this approach could complement existing text based and citation based techniques to clustering of research that might bridge the two approaches. By approaching references as “words” and reference lists as “sentences” (or documents) of such “words”, we demonstrate that the topical structure of document collections can also be analyzed using an alternative and complementary source of content, which additionally provides an interesting perspective on bibliographic references as units of a meta language describing document content.
This work in progress aims to perform a comprehensive analysis of digital data publicly available from the Swedish parliament, such as motions, interpellations, and protocols from discussions in the plenum. To this end, we will use methods based on computational linguistics, such as topic modeling, word embeddings, and sentiment analysis, to identify prominent discourses corresponding to co-occurrence patterns of the words used. Using data from the early 1970s and onward, this project will involve a chronological examination of semantic and discursive changes with regard to topics such as equality, neutrality, the EU, the monarchy, immigration, climate change, and cultural policy. One research question that will be investigated in relation to this analysis is to what extent discursive changes can be detected within specific political parties, and what historical and political reasons can be posited to underlie these changes.Another research question focuses on scientometric aspects of how scholarly research is used to support claims made in the political discussion. With regard to this question, we will more specifically investigate the conceptual aspects of the texts surrounding citations (so-called citances) which will be mined for information such as what research is referred to in terms of individuals, position, disciplinary affiliations, and active research topics. In these citances, the significant content is often present as latent references that require further elucidation, together with an analysis of the sentiments expressed in the argumentation. This analysis will be further enhanced by investigating the usage of hedge terms that may indicate a level of uncertainty about the cited research.
In this study, we demonstrate how to collect Twitter conversations emanating from or referring to scientific papers. We propose segmenting the conversational threads into smaller segments and then compare them using information retrieval techniques, in order to find differences and similarities between discussions and within discussions. While the method still can be improved, the study shows that it is possible to collect larger conversations about research on Twitter, and that these are suitable for various automated methods. We do however identify a need to analyse these with qualitative methods as well.
This article explores the broad and undefined research field of the social impact of the arts. The effects of art and culture are often used as justification for public funding, but the research on these interventions and their effects is unclear. Using a co-word analysis of over 10,000 articles published between 1990 and 2020, we examined the characteristics of the field as we have operationalised it through our searches. We found that since 2015 this research field has expanded and consists of different epistemologies and methodologies, summarised in largely overlapping subfields belonging to the social sciences, humanities, arts education, and arts and health/therapy. In formal or informal learning settings, studies of theatre/drama as an intervention to enhance skills, well-being, or knowledge among children are most common in our corpus. A study of the research front through the bibliographic coupling of the most cited articles in the corpus confirmed the co-word analysis and revealed new themes that together form the ground for insight into research on the social impact of the arts. This article can therefore inform discussions on the social value of culture and the arts.
In the search to secure funding, researchers must now respond to requests by governments and non-government organisations about how to measure the societal and professional impact of their research. While case studies and reports of interventions may provide grounds for qualitative evaluation, bibliometric methodology is emerging as an important quantitative supplement to these evaluations.
In clinical practice, treatment recommendations and clinical guidelines provide traces of clinical and professional practice that can be used to identify and measure research impact. To understand how these traces emerge the research reported here explores documents issued by the three main Swedish agencies who produce recommendations for clinical practice. In particular it examines the cited references within the documents to explore size distribution, reference age, and geographical aspects, in addition to the similarities of the cited reference structure between the producers of the documents.
The overall goal of this ongoing project is to gain insights into citation practice and distribution of publications in professional practice to provide grounds for developing indicators of clinical impact. Future applications with regard to the broader area of professional impact based on references found in the literature of a wide range of professions, e.g. the health sector, social welfare, engineering and the environmental realm are considered.
This study explores integrating generative AI to enhance citation context typing. Using Claude LLM, we generate synthetic data aligned with the Citation Typing Ontology (CiTO) to train a classifier. This supervised learning experiment involves training a classifier to identify citation types using this synthetic data. We evaluate the classifier’s performance on uncategorised citation statements. Additionally, we extend our analysis to test the classifier trained on English language citation context statements on statements extracted from Swedish and German research publications. A novel aspect of this work lies in the fusion of bibliometrics and experimental work in semantic modelling, employing language models to train machine learning models for research content evaluation. While acknowledging the inherent limitations of machine learning algorithms, we propose further testing using real-time scenarios and human evaluators. This study aims to push the boundaries of research methodology by integrating generative AI beyond text generation into the research process itself.
Adding the semantic content of texts to the study of citations opens for new means of research in the field. Words can be used in specific or more general terms. Their meaning changes through use. Correspondingly, the meaning of a cited reference is defined by its use. Furthermore, the meaning of the reference changes as it is used in different contexts. Using ‘word embeddings’ we create a conceptual space of references using a window of text around the references. The model is trained on a set of 2 million full-text articles derived from EuroPMC. We measure the length of the journey of the cited references in this space to determine how much their semantic meaning changes over time. Furthermore, we study the topical heterogeneity of the citation contexts inferred to the references by the citing documents.
In this explorative work we investigate to what degree the semantic meaning of a cited reference can be recognized. In the end, we explore the possibility to generate a dynamic classification of research based on its use, rather than on their content. This would make it possible to identify similar works irrespectively of manifest citation links (bibliographic coupling or co-citation) or identical content of words (co-word analysis).
Since many databases lack relevance ranking, a citation-based approach can be a valuable complement since it is possible to use citation-based data to indicate centrality, relevance, or visibility in the research community. However, using bibliometric methods in the humanities is often challenging since a lot of the research literature is not indexed in the traditional citation databases that we generally use for bibliometric mapping.
We introduce a combined bibliometric and semantic approach to extend a network of bibliographic records by incorporating a larger set of records lacking bibliometric features based on the semantic similarities between their titles. In order to expand the set of identified relevant articles, we used the Universal Sentence Encoder (USE) algorithm developed by Google Research to generate semantic vectors for the titles.
We searched several different databases, of which some include citation data, to create a pool C of candidate documents within the selected subject area. A set A of documents was obtained from a citation database to generate the initial network of articles. We then calculated the bibliographic coupling of articles as quantified by their shared references.
We manually selected a small set S1 ⊂ A of documents representing different topical clusters as a seed for the expansion based on semantic similarities. For each document d ∈ S1, we ranked the documents in C in ascending order according to their cosine distance to the title vector assigned to d, then selecting the k documents closest to d. This procedure gave us a set S2 ⊂ C of documents to read.
The results were evaluated using qualitative analysis to determine they were thematically relevant to the present information needs.
Recent medial discussion on “fake-news” underlines the importance of evidence-baseddecision-making. To gather, analyze and interpret “facts” is, however, in our information-densedigital times, not always easy. Activities such as information seeking, knowledge building andevaluation in scholarly practice are often performed using bibliometric/informetric methods.The increased interest in bibliometrics also opens for new questions on how data sources arebeing used and what kind of challenges and/or possibilities that warrant further investigation. In this session for interaction and engagement, we invite participants to explore both means ofanalyzing already available data sources using machine-learning technology, as well as toinclude new sets of data that could augment the different views of grasping research activitiesusing algorithms. Such data could be both content-intensive (as text), time-sensitive (as events), contextual (in terms of links between different properties) or multi-modal, meaning that othersources, such as imagery, sound, and video – even material objects may constitute possiblecontributions as data as impact.
Our system for the CoNLL 2008 shared task uses a set of individual parsers, a set of stand-alone semantic role labellers, and a joint system for parsing and semantic role labelling, all blended together. The system achieved a macro averaged labelled F1-score of 79.79 (WSJ 80.92, Brown 70.49) for the overall task. The labelled attachment score for syntactic dependencies was 86.63 (WSJ 87.36, Brown 80.77) and the labelled F1-score for semantic dependencies was 72.94 (WSJ 74.47, Brown 60.18).