---------------------------------------------------- FILE: README.txt DATE: 2010-11-17 ---------------------------------------------------- (c) Copyright 2009-2010, J.D. Power and Associates, All rights reserved, no re-distribution. This is the J.D. Power and Associates mention, co-reference, meronymy, and sentiment corpus. Cite this corpus as: Jason S. Kessler, Miriam Eckert, Lyndsie Clark, and Nicolas Nicolov. The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain. In the 4th International AAAI Conference on Weblogs and Social Media Data Challenge Workshop (ICWSM-DCW 2010), 2010. Washington, D.C. @inproceedings{KesslerEtAl2010, author = {Jason S. Kessler and Miriam Eckert and Lyndsie Clark and Nicolas Nicolov}, title = {The 2010 ICWSM JDPA Sentment Corpus for the Automotive Domain}, booktitle = {4th International AAAI Conference on Weblogs and Social Media Data Workshop Challenge (ICWSM-DWC 2010)}, year = {2010}, url = {http://www.cs.indiana.edu/\~{}jaskessl/icwsm10.pdf} } ==================================================== =============== OVERVIEW =========================== ==================================================== The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions - negators (expressions which invert the polarity of a sentiment expression or modifier) - neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier) - committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier) - intensifiers (expressions which shift the intensity of a sentiment expression or modifier) Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension. The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions and 15,637 sentiment expressions are annotated. ==================================================== =============== DIRECTORY STRUCTURE ================ ==================================================== doc/JDPA-Sentiment-Corpus-Annotation-Guidelines-ver-2009-12-17.pdf Description of the annotations. doc/JDPA-Sentiment-Corpus-Licence-ver-2009-12-17.doc The licence in MS-Word format. doc/README.txt This file. Below, DOMAIN may be "car" or "camera". The annotation files in XML format are in: $DOMAIN/batch*/annotation/*.xml The corresponding text files are in: $DOMAIN/batch*/txt/*.txt Some files have accompanying metadata, which includes the URL of the file's text. $DOMAIN/batch*/meta/*-meta.xml ==================================================== =============== FILE STRUCTURE ===================== ==================================================== The XML files provide stand-off annotations for their corresponding text files. The scheme, which follows, is based on the XML format used by the Protege plug-in Knowtator (http://knowtator.sourceforge.net/). Annotations span two or more tags in the tag. The first tag is , containing the subtag, specifying the id of the annotation. Next is the subtag, giving an anonymized annotator's id and pseudonym. specifies the start and end byte-offsets of the annotation and the text it spans while contains the text covered by the annotation. is optional and may omit some leading/trailing whitespace (or multiple whitespaces). See the tag below for an example. The second tag is , linked to the annotation tag's id by the "id" attribute. The only required subtag is , whose content and "id" attribute are the semantic type of the annotation. A tag may have zero or more subtags. Each of these corresponds to a property of the annotation, detailed in either a tag or a tag. The *SlotMention tags are linked via the "id" attribute in . is used for slots that have properties which are nominal, numeric or textual. The slot's name is in the "id" attribute of the subtag while the value of the slot is in the "value" attribute of the subtag. Some slots are used to refer to other annotations. These "complex" slots are specified through the tag. Like , this tag requires the subtag, whose "id" attribute specifies the name of the slot. However, its value is specified through the "value" attribute of subtag. The value is always the id of the annotation the slot refers to. Some tags have multiple subtags, each containing an annotation id. Here is an example: ... Annotator 3 Nissan Mention.Organization ... The semantic types of the annotations and their slots (in the XML files in the annotations/ directory) are explained in the annotation guidelines. ==================================================== =============== BATCH INFORMATION ================== ==================================================== Car section: Batch 001: First batch. Size: 78,604 tokens. Batch 004: Addition of Mention.CarFeature to distinguish concrete, removeable or purchasable CarParts from more abstract CarFeatures such as power, acceleration and drive. Size: 7,643 tokens. Batch 005: Batch consists of JDPower car review files. No changes made to annotation schema. Start of preannotation. Size: 42,019 tokens. Batch 006: Addition of Mention.Descriptor for adjectives preceding mention nouns, such as *heated*, *power* seats; MemberOf slot added to link individual mentions to a plural mention. Size: 95,864 tokens. Batch 007: Removal of Mention.Descriptor and addition of Descriptor class to reflect the fact that descriptors do not refer to discourse entities. Size: 11,221 tokens. Batch 008: Same format as Batch 007. Size: 30,612 tokens. ==================================================== =============== CONTRIBUTORS ======================= ==================================================== Claire Bonial Lyndsie Clark Miriam Eckert Meredith Green George Figgs Steliana Ivanova Hanna Lind Jason Kessler Nicolas Nicolov Ronald Woodward Whitney Zimmer ==================================================== =============== ACKNOWLEDGEMENTS =================== ==================================================== We would like to thank Prof. Martha Palmer, Prof. James Martin, and Prof. Michael Mozer of The University of Colorado at Boulder for insightful discussions on the corpus and Dr. Richard Wolniewicz, Chance Parker, and Rich Belanger of J.D. Power and Associates for supporting the project. ==================================================== =============== CONTACT ============================ ==================================================== ICWSM.JDPA.Corpus@gmail.com ==================================================== =============== NOTE =============================== ==================================================== The opinions and claims expressed in the corpus documents are those of the authors and assessed by human annotators. The possibility that one brand or model may have more positive or negative sentiment than another is not indicative of the difference of opinions in the blogosphere. --- END: README.txt --------------------------------