Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions

Curry, Gordon B; Connor, Richard C H

doi:10.1130/GES00140.1

Please use this identifier to cite or link to this item: http://hdl.handle.net/1893/27711

Full metadata record

DC Field	Value	Language
dc.contributor.author	Curry, Gordon B	en_UK
dc.contributor.author	Connor, Richard C H	en_UK
dc.date.accessioned	2018-09-05T13:55:22Z	-
dc.date.available	2018-09-05T13:55:22Z	-
dc.date.issued	2008-02-01	en_UK
dc.identifier.uri	http://hdl.handle.net/1893/27711	-
dc.description.abstract	Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.	en_UK
dc.language.iso	en	en_UK
dc.publisher	Geological Society of America	en_UK
dc.relation	Curry GB & Connor RCH (2008) Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions. Geosphere, 4 (1), pp. 159-169. https://doi.org/10.1130/GES00140.1	en_UK
dc.rights	The publisher does not allow this work to be made publicly available in this Repository. Please use the Request a Copy feature at the foot of the Repository record to request a copy directly from the author. You can only request a copy if you wish to use this work for your own research or private study.	en_UK
dc.rights.uri	http://www.rioxx.net/licenses/under-embargo-all-rights-reserved	en_UK
dc.subject	Geoinformatics	en_UK
dc.subject	data acquisition	en_UK
dc.subject	XML	en_UK
dc.subject	parsing	en_UK
dc.subject	taxonomy	en_UK
dc.subject	databases	en_UK
dc.title	Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions	en_UK
dc.type	Journal Article	en_UK
dc.rights.embargodate	2999-12-31	en_UK
dc.rights.embargoreason	[Curry Connor 2008.pdf] The publisher does not allow this work to be made publicly available in this Repository therefore there is an embargo on the full text of the work.	en_UK
dc.identifier.doi	10.1130/GES00140.1	en_UK
dc.citation.jtitle	Geosphere	en_UK
dc.citation.issn	1553-040X	en_UK
dc.citation.volume	4	en_UK
dc.citation.issue	1	en_UK
dc.citation.spage	159	en_UK
dc.citation.epage	169	en_UK
dc.citation.publicationstatus	Published	en_UK
dc.citation.peerreviewed	Refereed	en_UK
dc.type.status	VoR - Version of Record	en_UK
dc.contributor.funder	Engineering and Physical Sciences Research Council	en_UK
dc.contributor.funder	Biotechnology and Biological Sciences Research Council	en_UK
dc.author.email	richard.connor@stir.ac.uk	en_UK
dc.citation.date	01/02/2008	en_UK
dc.contributor.affiliation	University of Glasgow	en_UK
dc.contributor.affiliation	University of Strathclyde	en_UK
dc.identifier.isi	WOS:10.1130/GES00140.1	en_UK
dc.identifier.scopusid	2-s2.0-41949097911	en_UK
dc.identifier.wtid	956111	en_UK
dc.contributor.orcid	0000-0003-4734-8103	en_UK
dc.date.accepted	2007-09-11	en_UK
dcterms.dateAccepted	2007-09-11	en_UK
dc.date.filedepositdate	2018-08-16	en_UK
rioxxterms.apc	not required	en_UK
rioxxterms.type	Journal Article/Review	en_UK
rioxxterms.version	VoR	en_UK
local.rioxx.author	Curry, Gordon B\|	en_UK
local.rioxx.author	Connor, Richard C H\|0000-0003-4734-8103	en_UK
local.rioxx.project	Project ID unknown\|Biotechnology and Biological Sciences Research Council\|http://dx.doi.org/10.13039/501100000268	en_UK
local.rioxx.project	Project ID unknown\|Engineering and Physical Sciences Research Council\|http://dx.doi.org/10.13039/501100000266	en_UK
local.rioxx.freetoreaddate	2258-01-02	en_UK
local.rioxx.licence	http://www.rioxx.net/licenses/under-embargo-all-rights-reserved\|\|	en_UK
local.rioxx.filename	Curry Connor 2008.pdf	en_UK
local.rioxx.filecount	1	en_UK
local.rioxx.source	1553-040X	en_UK
Appears in Collections:	Computing Science and Mathematics Journal Articles

Files in This Item:

File	Description	Size	Format
Curry Connor 2008.pdf	Fulltext - Published Version	1.24 MB	Adobe PDF	Under Permanent Embargo Request a copy

This item is protected by original copyright

View License

Show simple item record

Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

The metadata of the records in the Repository are available under the CC0 public domain dedication: No Rights Reserved https://creativecommons.org/publicdomain/zero/1.0/

If you believe that any material held in STORRE infringes copyright, please contact library@stir.ac.uk providing details and we will remove the Work from public display in STORRE and investigate your claim.

STORRE

STORRE: Stirling Online Research Repository