A comparison of different strategies for automated semantic document annotation

Große-Bölting, Gregor; Nishioka, Chifumi; Scherp, Ansgar

doi:10.1145/2815833.2815838

Please use this identifier to cite or link to this item: http://hdl.handle.net/1893/28054

Appears in Collections:	Computing Science and Mathematics Conference Papers and Proceedings
Author(s):	Große-Bölting, Gregor Nishioka, Chifumi Scherp, Ansgar
Contact Email:	ansgar.scherp@stir.ac.uk
Title:	A comparison of different strategies for automated semantic document annotation
Citation:	Große-Bölting G, Nishioka C & Scherp A (2015) A comparison of different strategies for automated semantic document annotation. In: Proceedings of the 8th International Conference on Knowledge Capture (K-Cap 2015) 8th International Conference on Knowledge Capture (K-Cap '15), Palisades, NY, USA, 07.10.2015-10.10.2015. New York: ACM. https://doi.org/10.1145/2815833.2815838
Issue Date:	31-Dec-2015
Date Deposited:	22-Oct-2018
Conference Name:	8th International Conference on Knowledge Capture (K-Cap '15)
Conference Dates:	2015-10-07 - 2015-10-10
Conference Location:	Palisades, NY, USA
Abstract:	We introduce a framework for automated semantic document annotation that is composed of four processes, namely concept extraction, concept activation, annotation selection, and evaluation. The framework is used to implement and compare different annotation strategies motivated by the literature. For concept extraction, we apply entity detection with semantic hierarchical knowledge bases, Tri-gram, RAKE, and LDA. For concept activation, we compare a set of statistical, hierarchy-based, and graph-based methods. For selecting annotations, we compare top-k as well as kNN. In total, we define 43 different strategies including novel combinations like using graph-based activation with kNN. We have evaluated the strategies using three different datasets of varying size from three scientific disciplines (economics, politics, and computer science) that contain 100, 000 manually labelled documents in total. We obtain the best results on all three datasets by our novel combination of entity detection with graph-based activation (e.g., HITS and Degree) and kNN. For the economic and political science datasets, the best F-measure is .39 and .28, respectively. For the computer science dataset, the maximum F-measure of .33 can be reached. The experiments are the by far largest on scholarly content annotation, which typically are up to a few hundred documents per dataset only.
Status:	VoR - Version of Record
Rights:	The publisher does not allow this work to be made publicly available in this Repository. Please use the Request a Copy feature at the foot of the Repository record to request a copy directly from the author. You can only request a copy if you wish to use this work for your own research or private study.
Licence URL(s):	http://www.rioxx.net/licenses/under-embargo-all-rights-reserved

Files in This Item:

File	Description	Size	Format
Grosse-Bolting-etal-CP.pdf	Fulltext - Published Version	280.94 kB	Adobe PDF	Under Permanent Embargo Request a copy

Note: If any of the files in this item are currently embargoed, you can request a copy directly from the author by clicking the padlock icon above. However, this facility is dependent on the depositor still being contactable at their original email address.

This item is protected by original copyright

View License

Show full item record

Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

The metadata of the records in the Repository are available under the CC0 public domain dedication: No Rights Reserved https://creativecommons.org/publicdomain/zero/1.0/

If you believe that any material held in STORRE infringes copyright, please contact library@stir.ac.uk providing details and we will remove the Work from public display in STORRE and investigate your claim.

STORRE

STORRE: Stirling Online Research Repository