Preview only show first 10 pages with watermark. For full document please download

Methods And Techniques For Segmentation Of Consumers In Social

Rating
Date

November 2018
Size

3MB
Views

4,115
Categories

Baby & children Toys & accessories Toy vehicles

Transcript

Methods and Techniques for Segmentation of Consumers in Social Media PhD Thesis ´ Oscar Mu˜noz Garc´ıa (MSc Artiﬁcial Intelligence) Departamento de Inteligencia Artiﬁcial ETS de Ingenieros Inform´aticos Supervisors Asunci´on G´omez P´erez (PhD Computer Science, MBA) Ra´ ul Garc´ıa Castro (PhD Computer Science and Artiﬁcial Intelligence) 2015 Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Polit´ecnica de Madrid, el d´ıa de de . Presidente: Vocal: Vocal: Vocal: Secretario: Suplente: Suplente: Realizado el acto de defensa y lectura de la Tesis el d´ıa de de en la Escuela T´ecnica Superior de Ingenieros Inform´aticos. Caliﬁcaci´on: EL PRESIDENTE LOS VOCALES EL SECRETARIO i ii A Mari. Gracias por tu comprensi´on durante todo el tiempo que he dedicado a la tesis. A mis padres. Gracias por todo vuestro apoyo y motivaci´on sin los cuales no habr´ıa llegado hasta aqu´ı. A mi hija Luc´ıa. iv Acknowledgements This thesis represents the ﬁnal stage of a long period of my life I had never completed without the help of many people whom I thank for their inestimable support that worth its weight in gold. First of all, I want to acknowledge all the co-authors of the research works that have contributed to the contents included in these thesis: Silvia, In´es, Nuria, March, Beatriz, Gloria, Javier, Daniel, Jes´ us, David, Guadalupe, Auxi, Socorro, Elena, V´ıctor, and Carlos. This thesis would not have been possible without their hard work. Havas Media Group deserves a special recognition. I want to acknowledge my colleagues there for all their lessons about marketing and advertising. I could not imagine my professional career from now on without their support and training. Specially, I have no words to express my gratitude to Gloria. I also want to acknowledge the Spanish Centre for the Development of Industrial Technology that has partially supported this research under the CENIT program in the context of the Social Media Project (CEN-20101037). Thanks a lot to all the partners in this project. Finally, I want to acknowledge my supervisors, Asun and Ra´ ul, for their guidance, reviews and patience, during and before the writing of this thesis. I hope I have lived up to their expectations. vi Abstract Social media has revolutionised the way in which consumers relate to each other and with brands. The opinions published in social media have a power of inﬂuencing purchase decisions as important as advertising campaigns. Consequently, marketers are increasing eﬀorts and investments for obtaining indicators to measure brand health from the digital content generated by consumers. Given the unstructured nature of social media contents, the technology used for processing such contents often implements Artiﬁcial Intelligence techniques, such as natural language processing, machine learning and semantic analysis algorithms. This thesis contributes to the State of the Art, with a model for structuring and integrating the information posted on social media, and a number of techniques whose objectives are the identiﬁcation of consumers, as well as their socio-demographic and psychographic segmentation. The consumer identiﬁcation technique is based on the ﬁngerprint of the devices they use to surf the Web and is tolerant to the changes that occur frequently in such ﬁngerprint. The psychographic proﬁling techniques described infer the position of consumer in the purchase funnel, and allow to classify the opinions based on a series of marketing attributes. Finally, the socio-demographic proﬁling techniques allow to obtain the residence and gender of consumers. viii Resumen Los medios sociales han revolucionado la manera en la que los consumidores se relacionan entre s´ı y con las marcas. Las opiniones publicadas en dichos medios tienen un poder de inﬂuencia en las decisiones de compra tan importante como las campa˜ nas de publicidad. En consecuencia, los profesionales del marketing cada vez dedican mayores esfuerzos e inversi´on a la obtenci´on de indicadores que permitan medir el estado de salud de las marcas a partir de los contenidos digitales generados por sus consumidores. Dada la naturaleza no estructurada de los contenidos publicados en los medios sociales, la tecnolog´ıa usada para procesar dichos contenidos ha menudo implementa t´ecnicas de Inteligencia Artiﬁcial, tales como algoritmos de procesamiento de lenguaje natural, aprendizaje autom´atico y an´alisis sem´antico. Esta tesis, contribuye al estado de la cuesti´on, con un modelo que permite estructurar e integrar la informaci´on publicada en medios sociales, y una serie de t´ecnicas cuyos objetivos son la identiﬁcaci´on de consumidores, as´ı como la segmentaci´on psicogr´aﬁca y sociodemogr´aﬁca de los mismos. La t´ecnica de identiﬁcaci´on de consumidores se basa en la huella digital de los dispositivos que utilizan para navegar por la Web y es tolerante a los cambios que se producen con frecuencia en dicha huella digital. Las t´ecnicas de segmentaci´on psicogr´aﬁca descritas obtienen la posici´on en el embudo de compra de los consumidores y permiten clasiﬁcar las opiniones en funci´on de una serie de atributos de marketing. Finalmente, las t´ecnicas de segmentaci´on sociodemogr´aﬁca permiten obtener el lugar de residencia y el g´enero de los consumidores. x Contents 1 INTRODUCTION 1 1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Dissemination of Results . . . . . . . . . . . . . . . . . . . . . . . 7 2 STATE OF THE ART 2.1 2.2 Semantic Vocabularies for Representing Social Media Information 10 2.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Techniques for Tracking Users in the Web . . . . . . . . . . . . . 13 2.2.1 Techniques for Capturing Web Activity . . . . . . . . . . . 14 2.2.1.1 Technique Based on Web Logs . . . . . . . . . . 15 2.2.1.2 Technique Based on Web Beacons . . . . . . . . . 18 2.2.1.3 Technique Based on JavaScript Tags . . . . . . . 19 2.2.1.4 Technique Based on Packet Sniﬃng . . . . . . . . 21 Techniques for Identifying Unique Users . . . . . . . . . . 22 2.2.2.1 Technique Based on Cookies . . . . . . . . . . . . 22 2.2.2.2 Technique Based on Fingerprint . . . . . . . . . . 23 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Technique for Detecting the Evolution of Temporary Records . . . 27 2.3.1 Early Binding Algorithm [Li et al., 2011] . . . . . . . . . . 28 2.3.2 Late Binding Algorithm [Li et al., 2011] . . . . . . . . . . 28 2.3.3 Adjusted Binding Algorithm [Li et al., 2011] . . . . . . . . 29 2.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Social Media Analysis Applied to Market Research . . . . . . . . 30 2.4.1 32 2.2.2 2.2.3 2.3 2.4 9 KPIs Based on Social Media Analysis . . . . . . . . . . . . xi 2.4.2 2.5 2.6 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Marketing Background . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1 The Consumer Decision Journey . . . . . . . . . . . . . . . 37 2.5.2 The Marketing Mix . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Research on Human Emotions . . . . . . . . . . . . . . . . 40 2.5.4 Owned, Paid and Earned Media . . . . . . . . . . . . . . . 43 2.5.5 Marketing Technology . . . . . . . . . . . . . . . . . . . . 44 2.5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Analysis of Social Media Content . . . . . . . . . . . . . . . . . . 47 2.6.1 Lemmatisation and Part-Of-Speech Tagging . . . . . . . . 47 2.6.2 Normalisation of Microposts . . . . . . . . . . . . . . . . . 48 2.6.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 49 2.6.4 Identiﬁcation of Wishes . . . . . . . . . . . . . . . . . . . . 51 2.6.5 Detection of Place of Residence . . . . . . . . . . . . . . . 52 2.6.6 Detection of Gender . . . . . . . . . . . . . . . . . . . . . 53 2.6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Open Research Problems . . . . . . . . . . . . . . . . . . . . . . . 55 3 APPROACH 57 3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Contributions to the State of the Art . . . . . . . . . . . . . . . . 61 3.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 RESEARCH METHODOLOGY 71 4.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Method Followed for Obtaining the Artefacts Provided by this Thesis 74 4.3.1 Method Followed for Ontology Engineering . . . . . . . . . 76 4.3.2 Method Followed for the Data Mining Techniques . . . . . 79 4.3.2.1 Business Understanding . . . . . . . . . . . . . . 80 4.3.2.2 Data Understanding . . . . . . . . . . . . . . . . 80 xii 4.3.2.3 Data Preparation . . . . . . . . . . . . . . . . . . 81 4.3.2.4 Modelling . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.6 Deployment . . . . . . . . . . . . . . . . . . . . . 82 5 SOCIAL MEDIA ONTOLOGY FOR CONSUMER ANALYTICS 83 5.1 Ontology Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Notation Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Core Ontology Module . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Publication Channels Module . . . . . . . . . . . . . . . . . . . . 96 5.5 Contents Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6 Users Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 Opinions Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.8 Topics and Keywords Module . . . . . . . . . . . . . . . . . . . . 105 5.9 Geographical Locations Module . . . . . . . . . . . . . . . . . . . 106 6 MORPHOSYNTACTIC CHARACTERISATION OF SOCIAL MEDIA CONTENTS 109 6.1 Types of Social Media Analysed . . . . . . . . . . . . . . . . . . . 110 6.2 Distribution of Part-of-Speech Categories . . . . . . . . . . . . . . 111 6.3 6.2.1 Distribution of Nouns . . . . . . . . . . . . . . . . . . . . . 113 6.2.2 Distribution of Adjectives . . . . . . . . . . . . . . . . . . 113 6.2.3 Distribution of Adverbs . . . . . . . . . . . . . . . . . . . 114 6.2.4 Distribution of Determiners . . . . . . . . . . . . . . . . . 114 6.2.5 Distribution of Conjunctions . . . . . . . . . . . . . . . . . 114 6.2.6 Distribution of Pronouns . . . . . . . . . . . . . . . . . . . 115 6.2.7 Distribution of Prepositions . . . . . . . . . . . . . . . . . 115 6.2.8 Distribution of Punctuation Marks . . . . . . . . . . . . . 115 6.2.9 Distribution of Verbs . . . . . . . . . . . . . . . . . . . . . 116 Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 116 xiii 7 TECHNIQUE FOR UNIQUE USER IDENTIFICATION BASED ON EVOLVING DEVICE FINGERPRINT DETECTION 117 7.1 7.2 7.3 Data Understanding Activity . . . . . . . . . . . . . . . . . . . . 118 7.1.1 Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 119 7.1.2 Describe Data Task . . . . . . . . . . . . . . . . . . . . . . 123 7.1.3 Explore Data Task . . . . . . . . . . . . . . . . . . . . . . 124 7.1.4 Verify Data Quality Task . . . . . . . . . . . . . . . . . . . 130 Data Preparation Activity . . . . . . . . . . . . . . . . . . . . . . 131 7.2.1 Select Data Task . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.2 Clean Data Task . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.3 Construct Data Task . . . . . . . . . . . . . . . . . . . . . 132 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.3.1 Select Modelling Technique Task . . . . . . . . . . . . . . 133 7.3.1.1 Cluster Signature . . . . . . . . . . . . . . . . . . 135 7.3.1.2 Similarity Computation . . . . . . . . . . . . . . 137 7.3.1.3 Attribute Weight Computation . . . . . . . . . . 138 7.3.2 Generate Test Design Task . . . . . . . . . . . . . . . . . . 140 7.3.3 Build Model Task . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.3.1 X-Real-IP Header . . . . . . . . . . . . . . . . . 143 7.3.3.2 X-Forwarded-For Header . . . . . . . . . . . . . . 145 7.3.3.3 User-Agent Header . . . . . . . . . . . . . . . . . 146 7.3.3.4 Accept Header . . . . . . . . . . . . . . . . . . . 147 7.3.3.5 Accept-Language Header . . . . . . . . . . . . . . 148 7.3.3.6 Accept-Charset Header . . . . . . . . . . . . . . . 149 7.3.3.7 Accept-Encoding Header . . . . . . . . . . . . . . 151 7.3.3.8 Cache-Control Header . . . . . . . . . . . . . . . 152 7.3.3.9 Plugins . . . . . . . . . . . . . . . . . . . . . . . 153 7.3.3.10 Fonts . . . . . . . . . . . . . . . . . . . . . . . . 154 7.3.3.11 Video . . . . . . . . . . . . . . . . . . . . . . . . 156 7.3.3.12 Time zone . . . . . . . . . . . . . . . . . . . . . . 157 7.3.3.13 Session Storage . . . . . . . . . . . . . . . . . . . 158 7.3.3.14 Local Storage . . . . . . . . . . . . . . . . . . . . 160 7.3.3.15 Internet Explorer Persistence . . . . . . . . . . . 161 xiv 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.4.1 7.4.2 7.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 163 7.4.1.1 Rand Index . . . . . . . . . . . . . . . . . . . . . 164 7.4.1.2 Error Rate . . . . . . . . . . . . . . . . . . . . . 164 7.4.1.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . 164 7.4.1.4 Speciﬁcity . . . . . . . . . . . . . . . . . . . . . . 164 7.4.1.5 False Positive Rate . . . . . . . . . . . . . . . . . 165 7.4.1.6 False Negative Rate . . . . . . . . . . . . . . . . 165 7.4.1.7 Precision . . . . . . . . . . . . . . . . . . . . . . 165 7.4.1.8 F-measure . . . . . . . . . . . . . . . . . . . . . . 165 7.4.1.9 Purity . . . . . . . . . . . . . . . . . . . . . . . . 166 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 166 7.4.2.1 Variant Based on Uniform Weights . . . . . . . . 166 7.4.2.2 Variant Based on Attribute Entropy . . . . . . . 167 7.4.2.3 Variant Based on Time Decay . . . . . . . . . . . 169 7.4.2.4 Variant Based on Attribute Entropy and Time Decay . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4.2.5 Comparison of the Variants . . . . . . . . . . . . 171 Hypothesis Validation . . . . . . . . . . . . . . . . . . . . . . . . 173 8 TECHNIQUES FOR SEGMENTATION OF CONSUMERS FROM SOCIAL MEDIA CONTENT 175 8.1 Common Elements Used by the Techniques . . . . . . . . . . . . . 176 8.1.1 Collect Initial Data Task . . . . . . . . . . . . . . . . . . . 177 8.1.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 178 8.1.3 8.2 8.1.2.1 Select Data Task . . . . . . . . . . . . . . . . . . 179 8.1.2.2 Clean Data Task . . . . . . . . . . . . . . . . . . 180 8.1.2.3 Construct Data Task . . . . . . . . . . . . . . . . 182 Rule-based Modelling Technique . . . . . . . . . . . . . . . 187 Technique for Detecting Consumer Decision Journey Stages . . . . 191 8.2.1 Data Understanding Activity . . . . . . . . . . . . . . . . 191 8.2.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 192 8.2.1.2 Describe Data Task xv . . . . . . . . . . . . . . . . 195 8.2.2 8.3 8.3.2 8.2.1.4 Verify Data Quality Task . . . . . . . . . . . . . 197 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 200 8.2.2.1 Select Modelling Technique Task . . . . . . . . . 200 8.2.2.2 Build Model Task . . . . . . . . . . . . . . . . . 201 Data Understanding Activity . . . . . . . . . . . . . . . . 205 8.3.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 205 8.3.1.2 Describe Data Task 8.3.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 208 8.3.1.4 Verify Data Quality Task . . . . . . . . . . . . . 208 . . . . . . . . . . . . . . . . 207 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 210 8.3.2.1 Select Modelling Technique Task . . . . . . . . . 210 8.3.2.2 Build Model Task . . . . . . . . . . . . . . . . . 211 Technique for Detecting Emotions . . . . . . . . . . . . . . . . . . 212 8.4.1 8.4.2 8.5 Explore Data Task . . . . . . . . . . . . . . . . . 196 Technique for Detecting Marketing Mix Attributes . . . . . . . . . 205 8.3.1 8.4 8.2.1.3 Data Understanding Activity . . . . . . . . . . . . . . . . 213 8.4.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 213 8.4.1.2 Describe Data Task 8.4.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 215 8.4.1.4 Verify Data Quality Task . . . . . . . . . . . . . 216 . . . . . . . . . . . . . . . . 214 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 218 8.4.2.1 Select Modelling Technique Task . . . . . . . . . 218 8.4.2.2 Generate Test Design Task . . . . . . . . . . . . 219 8.4.2.3 Build Model Task . . . . . . . . . . . . . . . . . 219 Technique for Detecting Place of Residence . . . . . . . . . . . . . 223 8.5.1 Data Understanding Activity . . . . . . . . . . . . . . . . 223 8.5.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 224 8.5.1.2 Describe Data Task 8.5.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 225 . . . . . . . . . . . . . . . . 224 8.5.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 225 8.5.3 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 225 8.5.3.1 Select Modelling Technique Task . . . . . . . . . 226 8.5.3.2 Generate Test Design Task . . . . . . . . . . . . 235 xvi 8.6 Technique for Detecting Gender . . . . . . . . . . . . . . . . . . . 235 8.6.1 8.7 8.6.1.1 Collect Initial Data Task . . . . . . . . . . . . . . 235 8.6.1.2 Describe Data Task 8.6.1.3 Explore Data Task . . . . . . . . . . . . . . . . . 236 . . . . . . . . . . . . . . . . 236 8.6.2 Data Preparation Activity . . . . . . . . . . . . . . . . . . 236 8.6.3 Modelling Activity . . . . . . . . . . . . . . . . . . . . . . 237 8.6.3.1 Select Modelling Technique Task . . . . . . . . . 237 8.6.3.2 Generate Test Design Task . . . . . . . . . . . . 240 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 8.7.1 8.7.2 8.8 Data Understanding Activity . . . . . . . . . . . . . . . . 235 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 241 8.7.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . 242 8.7.1.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . 242 8.7.1.3 Precision . . . . . . . . . . . . . . . . . . . . . . 242 8.7.1.4 F-measure . . . . . . . . . . . . . . . . . . . . . . 242 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 243 8.7.2.1 Technique for Detecting Consumer Decision Journey Stages . . . . . . . . . . . . . . . . . . . . . . 243 8.7.2.2 Technique for Detecting Marketing Mix Attributes 246 8.7.2.3 Technique for Detecting Emotions . . . . . . . . 249 8.7.2.4 Technique for Detecting Place of Residence . . . 252 8.7.2.5 Technique for Detecting Gender . . . . . . . . . . 252 Validation of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 255 9 CONCLUSIONS AND FUTURE WORK 257 9.1 Social Media Data Model for Consumer Analytics . . . . . . . . . 258 9.2 Morphosyntactic Characterisation of Social Media Contents 9.3 Technique for Unique User Identiﬁcation Based on Evolving Device Fingerprint . . . . . . . . . . . . . . . . 259 9.4 Techniques for Segmentation of Consumers from Social Media Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 . . . 258 9.4.1 Technique for Detecting Consumer Decision Journey Stages 261 9.4.2 Technique for Detecting Marketing Mix Attributes xvii . . . . 262 9.4.3 9.4.4 9.4.5 9.4.6 9.4.7 Technique for Detecting Emotions . . . . . . . . . . . . . . Technique for Identifying the Place of Residence of Social Media Users . . . . . . . . . . . . . . . . . . . . . . . . . . Technique for Identifying the gender of Social Media Users Normalisation of User-Generated Content . . . . . . . . . Evaluation of Scalability . . . . . . . . . . . . . . . . . . . xviii 262 265 265 265 266 List of Figures 2.1 Process followed by the technique based on web logs (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Process followed by the technique based on web beacons (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 19 Process followed by the technique based on JavaScript tags (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 20 Process followed by the tags or web beacons techniques for gathering data from multiple sites (adapted from [Kaushik, 2007]) . . 20 Process followed by the technique based on packet sniﬃng (adapted from Kaushik [2007]) . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Consumer Decision Journey stages adopted in this thesis . . . . . 39 3.1 Contributions to the State of the Art . . . . . . . . . . . . . . . . 64 3.2 Relationships between the objectives, contributions, assumptions, hypothesis and restrictions . . . . . . . . . . . . . . . . . . . . . . 69 Relations between methodology, methods, techniques, processes, activities and tasks (adapted from G´omez-P´erez et al. [2004]) . . . 72 Iterative research methodology using exploratory and experimental approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Web mining framework (adapted from Hu and Cercone [2004]) . . 75 4.4 The CRISP-DM reference model (adapted from Shearer [2000]) . 79 5.1 Ontology network . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Social Graph Ontology Modeles . . . . . . . . . . . . . . . . . . . 86 2.2 2.3 2.4 2.5 4.1 4.2 xix 5.3 Class Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Object Property Example . . . . . . . . . . . . . . . . . . . . . . 87 5.5 Inverse Object Properties Example . . . . . . . . . . . . . . . . . 88 5.6 Class Inheritance Example . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Property Inheritance Example . . . . . . . . . . . . . . . . . . . . 89 5.8 Instances Example . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.9 Core ontology module of the SGO . . . . . . . . . . . . . . . . . . 91 5.10 Publication Channels module of the SGO . . . . . . . . . . . . . . 96 5.11 Contents module of the SGO . . . . . . . . . . . . . . . . . . . . . 97 5.12 Users module of the SGO . . . . . . . . . . . . . . . . . . . . . . 100 5.13 Opinions module of the SGO . . . . . . . . . . . . . . . . . . . . . 104 5.14 Topics and Keywords module of the SGO . . . . . . . . . . . . . . 105 5.15 Locations module of the SGO . . . . . . . . . . . . . . . . . . . . 107 7.1 Format of the data used by the technique for unique user identiﬁcation based on evolving device ﬁngerprint detection . . . . . . . 124 7.2 Daily distribution of visitors during the period of study . . . . . . 125 7.3 Daily distribution of visits during the period of study . . . . . . . 125 7.4 Daily distribution of page views during the period of study . . . . 125 7.5 Distribution of the activity records captured by unique user . . . 126 7.6 Distribution of visits per country . . . . . . . . . . . . . . . . . . 127 7.7 Disagreement decay for the X-Real-IP header (second interval) . . 144 7.8 Disagreement decay for the X-Real-IP header (ﬁrst interval) . . . 144 7.9 Agreement decay for the X-Real-IP header . . . . . . . . . . . . . 145 7.10 Agreement decay for the X-Forwarded-For header . . . . . . . . . 145 7.11 Disagreement decay for the User-Agent header . . . . . . . . . . . 146 7.12 Agreement decay for the User-Agent header . . . . . . . . . . . . 147 7.13 Disagreement decay for the Accept header . . . . . . . . . . . . . 148 7.14 Agreement decay for the Accept header . . . . . . . . . . . . . . . 148 7.15 Disagreement decay for the Accept-Language header . . . . . . . . 149 7.16 Agreement decay for the Accept-Language header . . . . . . . . . 149 7.17 Disagreement decay for the Accept-Charset header . . . . . . . . . 150 7.18 Agreement decay for the Accept-Charset header . . . . . . . . . . 150 xx 7.19 Disagreement decay for the Accept-Encoding header . . . . . . . . 151 7.20 Agreement decay for the Accept-Encoding header . . . . . . . . . 151 7.21 Disagreement decay for the Cache-Control header . . . . . . . . . 152 7.22 Agreement decay for the Cache-Control header . . . . . . . . . . 153 7.23 Disagreement decay for the Plugins attribute . . . . . . . . . . . . 154 7.24 Agreement decay for the Plugins attribute . . . . . . . . . . . . . 154 7.25 Disagreement decay for the Fonts attribute (second interval) . . . 155 7.26 Disagreement decay for the Fonts attribute (ﬁrst interval) . . . . 155 7.27 Agreement decay for the Fonts attribute . . . . . . . . . . . . . . 156 7.28 Disagreement decay for the Video attribute . . . . . . . . . . . . . 157 7.29 Agreement decay for the Video attribute . . . . . . . . . . . . . . 157 7.30 Disagreement decay for the Time zone attribute . . . . . . . . . . 158 7.31 Agreement decay for the Time zone attribute . . . . . . . . . . . 158 7.32 Disagreement decay for the Session Storage attribute . . . . . . . 159 7.33 Agreement decay for the Session storage attribute . . . . . . . . . 159 7.34 Disagreement decay for the Local storage attribute . . . . . . . . . 160 7.35 Agreement decay for the Local Storage attribute . . . . . . . . . . 160 7.36 Disagreement decay for the Internet Explorer persistence attribute 161 7.37 Agreement decay for the Internet Explorer persistence attribute . 162 7.38 Performance of the variants evaluated for the technique for unique user identiﬁcation based on evolving device ﬁngerprint detection . 172 8.1 Initial Data Collection task executed by the content-analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.2 Data Preparation Activity implemented by the content-analysis techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.3 Clean data task executed by the content-analysis techniques . . . 180 8.4 Construct data task executed by the content-analysis techniques . 183 8.5 Format of the data used by the technique for detecting Consumer Decision Journey stages . . . . . . . . . . . . . . . . . . . . . . . 195 8.6 Distribution of the texts along the media sources and sectors for the Consumer Decision Journey gold standard . . . . . . . . . . . 196 xxi 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 8.23 8.24 8.25 8.26 8.27 8.28 8.29 8.30 Distribution of the texts along the Consumer Decision Journey categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example annotation of a post according to a Consumer Decision Journey category using Amazon Mechanical Turk . . . . . . . . . Format of the data used by the technique for detecting Marketing Mix attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example annotation of a post according to a Marketing Mix Category using Amazon Mechanical Turk . . . . . . . . . . . . . . . . Format of the data used by the technique for detecting emotions . Example annotation of a post according to a Emotions category using Amazon Mechanical Turk . . . . . . . . . . . . . . . . . . . Data format of the corpus used by the technique for detecting the place of residence of social media users . . . . . . . . . . . . . . . Example of user proﬁle location metadata . . . . . . . . . . . . . Example of an output of the Google Geocoding API . . . . . . . . Example execution of table location ﬁltering process . . . . . . . . Example of user proﬁle description metadata . . . . . . . . . . . . Example of location extraction from content . . . . . . . . . . . . Data format of the corpus used by the technique for detecting the gender of social media users . . . . . . . . . . . . . . . . . . . . . Example of user proﬁle name metadata . . . . . . . . . . . . . . . Dependency tree obtained from a tweet that mentions to a user . Accuracy of the Consumer Decision Journey classiﬁer for English Accuracy of the Consumer Decision Journey classiﬁer for Spanish Accuracy of the Consumer Decision Journey classiﬁer by sector . Accuracy of the Marketing Mix classiﬁer for English . . . . . . . . Accuracy of the Marketing Mix classiﬁer for Spanish . . . . . . . Accuracy of the emotions classiﬁer . . . . . . . . . . . . . . . . . Accuracy of the emotions classiﬁer by sector . . . . . . . . . . . . Accuracy of the emotions classiﬁer by social media type . . . . . . Performance of the gender recognition approaches . . . . . . . . . xxii 197 199 208 209 215 217 225 227 228 230 232 234 236 238 240 244 245 246 247 248 250 251 251 253 List of Tables 2.1 Preﬁxes that can be declared in a web server log ﬁle . . . . . . . . 16 2.2 Identiﬁers that can be declared in a web server log ﬁle . . . . . . . 17 2.3 Subcategories of the Marketing Mix elements . . . . . . . . . . . . 40 2.4 Categories for the sentiment classiﬁcation, organised according to their polarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Relations between the conceptual framework of emotions used in this thesis and the Wordnet-Aﬀect taxonomy . . . . . . . . . . . . 43 Example lemmatisation and part-of-speech tagging of an example text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Vocabularies selected for deﬁning the Social Graph Ontology . . . 84 5.2 Properties of the class sioc:UserAccount . . . . . . . . . . . . . . 92 5.3 Properties of the class sioc:Post (1/2) . . . . . . . . . . . . . . . . 93 5.4 Properties of the class sioc:Post (2/2) . . . . . . . . . . . . . . . . 94 5.5 Properties of the class sioc:Forum . . . . . . . . . . . . . . . . . . 94 5.6 Properties of the class marl:Opinion . . . . . . . . . . . . . . . . 95 5.7 Properties of the class skos:Concept . . . . . . . . . . . . . . . . . 95 5.8 Properties of the class sioc:Community . . . . . . . . . . . . . . . 95 5.9 Properties of the class rdfg:Graph . . . . . . . . . . . . . . . . . . 95 5.10 Properties of the class sioc:Site . . . . . . . . . . . . . . . . . . . 96 5.11 Properties of the class foaf:Document . . . . . . . . . . . . . . . . 98 5.12 Properties of the class schema:Review . . . . . . . . . . . . . . . . 98 2.5 2.6 5.13 Property of the class sioc:Role . . . . . . . . . . . . . . . . . . . . 101 5.14 Properties of the class foaf:Agent . . . . . . . . . . . . . . . . . . 101 5.15 Properties of the class foaf:Person . . . . . . . . . . . . . . . . . . 101 xxiii 5.16 Properties of the class foaf:Activity . . . . . . . . . . . . . . . . . 101 5.17 Properties of the class sgo:Cookie . . . . . . . . . . . . . . . . . . 102 5.18 Properties of the class sgo:Fingerprint . . . . . . . . . . . . . . . 102 5.19 Properties of the class tzont:PoliticalRegion . . . . . . . . . . . . . 107 5.20 Properties of the class tzont:Country . . . . . . . . . . . . . . . . 108 5.21 Properties of the class tzont:State . . . . . . . . . . . . . . . . . . 108 5.22 Properties of the class tzont:County . . . . . . . . . . . . . . . . . 108 5.23 Properties of the class tzont:City . . . . . . . . . . . . . . . . . . 108 5.24 Properties of the class schema:Continent . . . . . . . . . . . . . . 108 5.25 Properties of the class tzont:TimeZone . . . . . . . . . . . . . . . 108 6.1 Distribution of part-of-speech categories by social media type . . . 112 7.1 Statistics associated to the number of records gathered per unique user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2 Distribution of visits for the 10 countries that generated more site activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.3 Entropy of ﬁngerprint attributes . . . . . . . . . . . . . . . . . . . 128 7.4 Cross-entropy between pairs of ﬁngerprint attributes 7.5 Conditional entropy between pairs of ﬁngerprint attributes . . . . 130 7.6 User-Agent values for Google, Bing, and Yahoo! robots . . . . . . 132 7.7 Disagreement decay of ﬁngerprint attributes . . . . . . . . . . . . 142 7.8 Agreement decay of ﬁngerprint attributes . . . . . . . . . . . . . . 143 7.9 Evaluation results for the variant based on uniform weights . . . . 167 . . . . . . . 129 7.10 Evaluation results for the variant based on attribute entropy . . . 168 7.11 Evaluation results for the variant based on time decay . . . . . . . 169 7.12 Evaluation results for the variant based on attribute entropy and time decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.13 Comparison of the variants with more performance . . . . . . . . 173 8.1 Examples of the linguistic patterns for identifying Consumer Decision Journey stages . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.2 Primary and secondary sentiments 8.3 Distribution of texts for the sentiment corpus by social media type 215 xxiv . . . . . . . . . . . . . . . . . 214 8.4 8.5 Distribution of texts for the sentiment corpus by domain . . . . . Distribution of texts for the sentiment corpus for the training and test sets by domain . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Excerpt from sentiments in Badele3000 . . . . . . . . . . . . . . . 8.7 Examples of rules for classifying emotions . . . . . . . . . . . . . 8.8 Collocations of “odio” in Badele3000 . . . . . . . . . . . . . . . . 8.9 Accuracy of the place of residence identiﬁcation approaches . . . . 8.10 Coverage of the gender recognition approaches . . . . . . . . . . . 8.11 Confusion matrix with the results of the approach based on mentions to users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 216 219 221 223 223 252 253 254 Rule reordering example . . . . . . . . . . . . . . . . . . . . . . . 264 xxv xxvi Chapter 1 INTRODUCTION The rise of Web 2.0 technologies and social media has enabled users to author their own content. This has populated the Web with huge amounts of user-generated content that can be exploited for many diﬀerent and interesting purposes, such as explaining or predicting real world outcomes through opinion mining, which provides a valuable tool for market research. Data scientists in almost every industry that is exposed to public opinion are under pressure to deal with the explosive growth of social media. Such professionals must be aware of what is said about the issues that aﬀect their business in diﬀerent social media channels. Social media are media in which information is created by the interaction of users, who express their opinions freely and spontaneously. This has revolutionised the way in which organisations and consumers interact. Users have adopted massively these channels to engage in conversations about content, products, and brands, while organisations are striving to adapt proactively to the threats and opportunities that this new dynamic environment poses. Social media is a knowledge mine about users, communities, preferences and opinions, which has the potential to impact positively marketing and product development activities [Weber, 2007]. In the marketing ﬁeld, media and society digitalisation has revolutionised the rules of traditional brand communication with an explosion of channels and possibilities for brands to contact consumers. Brands and media agencies are facing a big challenge developing systems to assure the best communication strategy for the brand (in terms of cost, eﬀectiveness and eﬃciency). Activities such as 1 word-of-mouth advertising where products or brands are promoted via oral or written communication have successfully adapted to social media through viral processes. It is becoming essential to know the views of consumers towards brands and products for designing advertisement campaigns, estimating future sales and deciding the strategy to follow when launching a new brand image. According to a Nielsen [2012b] report, 70% of social media users take into account the product experience published by other users; 65% declare to search information about brands, products and services; 53% express positive comments on brands; and 50% express complaints at least once per month. Social media monitoring tools are being used successfully in a range of domains (including market research, online publishing, etc.). However, tools available nowadays to analyse social media do not leverage completely the rich and complex information structure generated by users. Most of these tools elaborate their reports from metrics based on volume of posts, opinion polarity about the subject that is being studied, and users’ reputation. Although such metrics are good indicators of a subject’s popularity and relevance, these metrics are often inadequate for capturing complex multi-modal dimensions of the subjects to be measured that are relevant to business, and must be complemented with ad-hoc studies such as opinion polls. Therefore, existing opinion-mining techniques must be extended for discovering other aspects of discourse, such as consumer intents, mood and emotions. Overcoming some of the limitations of current tools to manage and analyse the information produced in social media is a pending challenge that this thesis addresses. The main goal of this thesis is to provide a data model and a set of techniques based on Web users tracking and natural language processing for extracting semantic information from the contents generated by consumers in social media. In the following paragraphs we introduce the speciﬁc contributions of this thesis to the State of the Art. The disparity of formats, mechanisms for accessing the information, content sizes, and metadata hinders the collection, integration and processing of the content published in social media, forcing to use speciﬁc methods and techniques for each kind of media. In this thesis, we provide a data model for the marketing domain that can be using for standardising and normalising the 2 information that can be extracted from social media about consumers, brands, media and opinions of consumers about brands (C1). The distributed nature of the Web and the disparity of devices, that can be used to access social media (PCs, smartphones, tablets, smart TVs, etc.) make diﬃcult to track the actions performed by users for web analytics purposes. Unique user identiﬁcation is a key task within the web analytics data collection process, and is useful for measuring the eﬀectiveness of online advertising campaigns, among other applications. The ﬁngerprinting technique consists in tracking user activity on a set of sites by capturing technical information about the browser and the machine that the user employs to navigate the Web. Browser ﬁngerprinting has been demonstrated to be an eﬀective method for unique user identiﬁcation when the device used to navigate the Web does not support cookies. However as the attributes used for generating browser ﬁngerprint evolve, multiple distinct ﬁngerprint records are created for the same user, leading to incorrect unique user identiﬁcation. This thesis contributes to the State of the Art with a technique for unique user identiﬁcation that detects browser ﬁngerprint evolution (C3). In the last decade, the availability of digital user-generated documents from social media has dramatically increased. This massive growth of user-generated content has also aﬀected traditional shopping behaviour. Customers have embraced new communication channels such as microblogs and social networks that enable them not just to talk with friends and acquaintances about their shopping experience, but also to search for opinions expressed by complete strangers as part of their decision making processes. Uncovering how customers feel about speciﬁc products or brands and detecting purchase habits and preferences has traditionally been a costly and highly time-consuming task which involved the use of methods such as focus groups and surveys. However, the new scenario calls for a deep assessment of current market research techniques in order to better interpret and proﬁt from this ever-growing stream of attitudinal data. With this purpose, we present a novel analysis and classiﬁcation of user-generated content in terms of it belonging to one of the four stages of the Consumer Decision Journey [Court et al., 2009] (i.e. the purchase process from the moment when a customer is aware of the existence of the product 3 to the moment when he or she buys, experiences and talks about it) (C4.1). Using a corpus of short texts written in English and Spanish and extracted from diﬀerent social media, this thesis identiﬁes a set of linguistic patterns for each purchase stage that will be then used in a rule-based classiﬁer. Additionally, we use machine-learning algorithms to automatically identify business indicators such as the Marketing Mix elements [McCarthy and Brogowicz, 1981] (C4.2). Sentiment analysis of social media is of commercial interest as user-generated content published in the Web reaches and inﬂuences many potential customers. Most work in this ﬁeld has focused on opinion polarity (positive or negative) and, therefore, does not specify the kind of sentiment related to that opinion. In order to provide this information, this thesis establishes four polarised categories that capture the main sentiments that can be found on social media: satisfaction-dissatisfaction (SD), trust-fear (TF), love-hate (LH), and happiness-sadness (HS). It develops a rule-based system that classiﬁes texts in Spanish from those social media, according to this sentiment classiﬁcation with respect to a brand, company or product. The rules have been written in a simple grammar after (linguistically) analysing a corpus of diﬀerent business domains whose texts had been manually classiﬁed (C4.3). Characterising users through demographic attributes is a necessary step before conducting opinion surveys from information published by such users in social media. In this thesis, we describe, compare and evaluate diﬀerent techniques for the identiﬁcation of the attributes “gender” (C4.4) and “place of residence” (C4.5) by mining the metadata associated to the users, the content published and shared by themselves, and their friendship networks. Natural language processing techniques are a key technology for analysing user-generated content. Despite some eﬀorts have been done to structure social media information, such as Twitlogic [Shinavier, 2010], there is still the need for approaches that are able to cope with the diﬀerent channels in the Social Web and with the challenges they pose. The content published in social media is characterised by the use of casual language; social media posts contain texts that vary in length from short sentences in microblogs to medium-size articles in web logs. Very often the text published in social media contains misspellings, is com- 4 pletely written in uppercase or lowercase letters, or it is composed of set phrases, among other characteristics that challenge existing content analysis techniques, leading to problems regarding the accuracy of natural language processing tools like part-of-speech taggers. As an example, for the Spanish language, the absence of an accent in a word may give such word a completely diﬀerent meaning. As a minor contribution, this thesis studies the diﬀerences of the language used in heterogeneous social media sources, by analysing the distribution of the part-of-speech categories extracted from the analysis of the morphology of a sample of texts published in such sources, showing that the task of normalising user-generated content is a necessary step before analysing social media posts, particularly on Twitter1 (C2). Therefore the content analysis techniques proposed by this thesis implement a stage that performs a morphological normalisation of user-generated content that makes use of on-line and collectively developed resources, including Wikipedia2 and a SMS lexicon. The results obtained demonstrate that the normalisation of user-generated content improves slightly the accuracy of the content analysis techniques presented in this thesis. 1.1 Thesis Structure This thesis is structured as follows: • Chapter 2 reviews the State of the Art and identiﬁes the open research problems addressed in this thesis. • Chapter 3 presents the objectives of this thesis, which were deﬁned according to the open research problems identiﬁed in Chapter 2. In addition, we present the contributions to the State of the Art, as well as the assumptions and hypotheses on which our contributions rely. Finally we describe the restrictions, which deﬁne the scope of the diﬀerent contributions. • Chapter 4 presents the research methodology, and the method followed for obtaining the artefacts provided by this thesis, which is inspired in an 1 2 http://twitter.com http://www.wikipedia.org 5 existing framework for web mining. For deﬁning the model of the data warehouse we have followed an existing methodology for building ontology networks. For addressing the rest of the phases deﬁned by the framework, we have followed an existing data mining process model. • Chapter 5 describes the data model that we have designed for representing the information extracted from social media for the marketing domain. • Chapter 6 characterises the diﬀerent kinds of social media according to the morphosyntactic characteristics of the textual content published in such media. • Chapter 7 provides a technique for uniquely identifying users in social media based on the ﬁngerprint of their devices, regardless the evolution of such ﬁngerprints. The chapter also presents the evaluation results and describes the data set used for evaluating the technique. • Chapter 8 presents a collection of techniques for extracting sociodemographic and psychographic proﬁles from social media users applied to the marketing domain, through the analysis of the opinions they express about brands, as well as from the proﬁles published by them in social networks. The chapter also presents the evaluation results and describes the data sets used for evaluating the techniques. • Finally, Chapter 9 presents research conclusions and possible future lines of research and innovation. 6 1.2 Dissemination of Results Some of the contributions produced within the framework of this thesis have been published in international peer-reviewed journals, conferences and workshops. In the following we list the contributions along with the publications that support them. The technique proposed for uniquely identifying users in social media based on the ﬁngerprint of their devices has been published in an international journal: ´ Oscar Mu˜ noz-Garc´ıa, Javier Monterrubio-Mart´ın, Daniel Garc´ıaAubert. Detecting browser ﬁngerprint evolution for identifying unique users. International Journal of Electronic Business, 10(2):120–141, 2012, ISSN 1470-6067, DOI 10.1504/IJEB.2012.051116. The techniques proposed for classifying user-generated content into Consumer Decision Journey stages and Marketing Mix elements have been published in an international journal indexed by JCR: ´ Silvia V´azquez, Oscar Mu˜ noz-Garc´ıa, In´es Campanella, Marc Poch, Beatriz Fisas, Nuria Bel, Gloria Andreu. A classiﬁcation of usergenerated content into Consumer Decision Journey stages. Neural Networks, 58:68–81, October 2014, ISSN 0893-6080, DOI 10.1016/J.NEUNET.2014.05.026. The technique proposed for detecting emotions has been published in the proceedings of a Spanish conference: Guadalupe Aguado-de-Cea, Mar´ıa Auxiliadora Barrios, Mar´ıa So´ corro Bernardos, In´es Campanella, Elena Montiel-Ponsoda, Oscar Mu˜ noz-Garc´ıa, V´ıctor Rodr´ıguez. An´ alisis de sentimientos en un corpus de redes sociales. In Proceedings of the 31st AESLA (Asociaci´on Espa˜ nola de Ling¨ u´ıstica Aplicada) International Conference, San Crist´obal de la Laguna, Tenerife, Spain, April 2014. 7 The techniques proposed for identifying the place of residence and gender of social media users have been published in a Spanish journal: ´ Oscar Mu˜ noz-Garc´ıa, Jes´ us Lanchas Sampablo, David Prieto Ru´ız. Characterising social media users by gender and place of residence. Procesamiento del Lenguaje Natural, 51:57–64 , September 2013, ISSN 1135-5948. The characterisation of the diﬀerent kinds of social media according to the morphosyntactic characteristics of the textual content published in such media has been published in the proceedings of an international workshop: ´ Oscar Mu˜ noz-Garc´ıa, Carlos Navarro. Comparing user generated content published in diﬀerent social media sources. In Proceedings of the NLP can u tag #user generated content ?! via lrecconf.org Workshop co-located with Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1–8, Istanbul, Turkey, 26 May 2012. Finally, the approach that we follow for performing morphological normalisation of social media posts has been published in the proceedings of a Spanish workshop: ´ Oscar Mu˜ noz-Garc´ıa, Silvia V´azquez Su´arez, Nuria Bel. Exploiting Web-based collective knowledge for micropost normalisation. In Proceedings of the Tweet Normalization Workshop co-located with 29th Conference of the Spanish Society for Natural Language Processing (SEPLN 2013), pp. 10–14, Madrid, Spain, 20 September 2013, ISSN 1613–0073. 8 Chapter 2 STATE OF THE ART This chapter reviews the State of the Art regarding the objectives of this thesis. The information published in social media consists of connected data by nature, due to the interlinked nature of social networks. Therefore, graph-based data models are an appropriate way of representing the relationships between the users and contents included in social media. Section 2.1 describes existing semantic vocabularies that can be used for representing social media information. Such vocabularies will be reused in this thesis to provide a normalised schema for structuring the information published in social media. This thesis provides a technique for unique user identiﬁcation which is an essential step for tracking the activity of users in the Web. Section 2.2 describes the existing techniques for tracking users in the Web, while Section 2.3 describes a technique for detecting the evolution of temporary records, upon which our user technique for identifying unique users is based. Additionally, this thesis has a strong business context, and its objectives are devoted to solve speciﬁc problems related with the marketing ﬁeld. Section 2.4 describes the State of the Art on social media analysis applied to market research, while Section 2.5 introduces the marketing background upon which the contributions of our thesis are based. Finally, many of the contributions of this thesis rely on natural language processing techniques applied to the analysis of textual content published in social media, whose State of the Art is described in Section 2.6. In the following we detail the State of the Art and existing research problems related with it. 9 2.1 Semantic Vocabularies for Representing Social Media Information Social media and the online communities built around them are silos whose users, contents, topics, etc. are rarely connected among them (e.g. Twitter data is not connected with Facebook3 data), except for minor service integrations (e.g. publishing a tweet whenever an status update is made in a LinkedIn4 account). In addition, there is no uniﬁed data format according to which to express the information posted to every social media. For example, the data published using the Facebook Graph API5 does not match the one used by the Twitter API6 , neither do match the content syndication formats RSS7 and Atom [Nottingham and Sayre, 2005], commonly used by weblogs and news publication sites. Format heterogeneity and cross-social network integration issues diﬃcult data gathering and the integrated analysis of the data published in social media. SIOC [Breslin et al., 2006] is a Semantic Web ontology designed to cope with these issues. It uses RDF8 for representing data published in social media, allowing linking posts, authors, topics, and other concepts, regardless speciﬁc social networks, therefore providing a mechanism for integrating information related to online communities. The SIOC vocabulary is linked with FOAF [Graves et al., 2007] for representing information about users and user-accounts. FOAF deﬁnes a data model of persons and relationships between persons, including mappings with other Semantic Web vocabularies, like Schema.org9 . Schema.org is a vocabulary designed for marking up HTML10 pages to improve indexing and metadata visualisation by search providers like Google11 , Yahoo!12 3 http://www.facebook.com http://www.linkedin.com 5 http://developers.facebook.com/docs/graph-api 6 https://dev.twitter.com 7 http://www.rssboard.org/rss-specification 8 http://www.w3.org/TR/rdf11-concepts 9 http://schema.org 10 http://www.w3.org/TR/html5 11 http://www.google.com 12 http://www.yahoo.com 4 10 and Bing13 . This vocabulary includes a rich set of classes and properties that can be used for complementing the ones provided by SIOC and FOAF for annotating users and contents. Additionally, SIOC reuses the Dublin Core vocabulary14 for aggregating metadata to posts (e.g. title, summary, publication date) using properties standardised by DCMI (Dublin Core Metadata Initiative)15 . The SIOC speciﬁcation16 suggests using SKOS [Miles et al., 2005] for representing topics according to which contents can be categorised. SKOS is a RDF vocabulary that provides a model for representing conceptual schemes such as, thesauri, classiﬁcation schemes, subject heading lists, taxonomies, and other king of controlled vocabularies within the framework of the Semantic Web. Regarding geo-localisation of contents and users, FOAF is linked with the WGS8417 vocabulary that allows annotating resources with geographical coordinates. In addition, for representing time zones and political regions (e.g. countries and states) the ontology Time Zone18 can be used. Schema.org also provides ontology elements for describing spatial features of web resources. SIOC does not provide ontology elements, neither a recommendation for annotating the content with the results of natural language analysis processes. Nevertheless there exists multiple vocabularies that can be used for performing this task. As an example, the categorisation model ISOcat [Kemps-Snijders et al., 2008] can be used for annotating contents with linguistic information based on a standardised set of categories. With respect to Opinion Mining, Marl [Westerski et al., 2011] is an ontology used for annotating and describing opinions according to the polarity expressed in them with respect to speciﬁc entities (e.g. brands, persons) mentioned in social media. Therefore it provides ontology elements for classifying opinions into three possible categories of polarity (i.e. positive, negative, neutral) and for quantifying such polarity according to a numeric scale. Additionally, the Onyx 13 http://www.bing.com http://dublincore.org/documents/dcmi-terms 15 http://dublincore.org 16 http://rdfs.org/sioc/spec 17 http://www.w3.org/2003/01/geo 18 http://www.w3.org/2006/timezone 14 11 ontology [S´anchez-Rada and Iglesias, 2013] allows categorising opinions into a broader set of emotions, like the ones described by the Wordnet-Aﬀect taxonomy [Valitutti et al., 2004]. Multiple instances of social graphs can be used to perform analyses with different data sets (e.g. for analysing diﬀerent domains or markets). These instances can be treated separately with RDF named graphs, and each named graph can be described by using the graph description metadata, like the one provided by the RDFG vocabulary [Carroll et al., 2005]. Finally, the PROV-O (PROV Ontology)19 provides a set of ontology elements that can be used for representing and exchanging information of the provenance of data generated by diﬀerent systems. Therefore, it can be used within the social media ﬁeld for indicating the content authoring entities and referencing publication sources. PROV-O has been mapped with the Dublin Core vocabulary20 , which in turn is mapped with FOAF. Thus, expressing social media facts using the FOAF and Dublin Core vocabularies automatically adds provenance information through the existing mappings. All these vocabularies are richer enough for describing general-purpose social graphs. However, we have not found during our survey vocabularies that allow describing some of the concepts related with the Marketing domain this thesis deals with and that will be explained in the following sections. Neither there exists a uniﬁed model that integrates the diﬀerent vocabularies. 2.1.1 Conclusions Open Research Problem 1. While there exist data models for representing information captured from social media, either generic or social-network-speciﬁc, there are not schemas that integrate such information with marketing-speciﬁc classiﬁcations and KPIs (Key Performance Indicators) obtained from the analysis of the content generated by the consumers and the activity produced by them in social media. Therefore, the existing vocabularies may be extended with ontology elements that model marketing-related knowledge. 19 20 http://www.w3.org/TR/prov-o http://www.w3.org/TR/prov-dc 12 Open Research Problem 2. The existing data models for representing social media information characterise the metadata that accompany the content published in the diﬀerent kind of media. However, there is not a characterisation of such media according to the linguistic features of the textual contents published on them. 2.2 Techniques for Tracking Users in the Web User tracking consists in registering the activity of users as they interact with one or more websites so that such activity can be related with speciﬁc, uniquely identiﬁed users. The tracking of users is an essential activity in order to perform Web Analytics. Web Analytics is the professional discipline designed to draw conclusions, deﬁne strategies, or establish business rules on the basis of data collected in all web environments on which a company has control [Maldonado, 2009]. Web Analytics is a professional discipline because there is an industry related to Business Intelligence, Market Research and Marketing, which demands professionals with Web Analytics skills, which provide insights to their customers. Web Analytics allows studying the behaviour of users in websites, drawing conclusions, such as why they came to the site and from where, why they leave and where they went to, why they do not perform the actions we were expecting them to perform, or what search terms were used to get to the website. The strategies and business rules that Web Analytics enable are oriented to drive a continous improvement of the online experience that customers and potential customers have, leading them to website desired outcomes [Kaushik, 2009]. Web Analytics are used for measuring the performance of websites in a commercial context, providing a measurement model to Digital Marketing, allowing to quantify the eﬀectiveness and impact of advertisement campaigns in digital media. Data gathered by applying Web Analytics (e.g. number of persons that have visualised a banner) are typically compared against KPI (e.g. outreach of a campaign) and used to improve the audience response to marketing campaigns (e.g. move the banner to a site with more audience). The most signiﬁcant KPIs depend on counting unique visitors. 13 Within a Web Analytics context, the data collection process consists in recording the activity generated by users while they interact with a set of websites. Such recorded activity may contain records about advertisement impressions, clicks on web page hyperlinks, and other navigational information. Collected data is useful for a number of marketing activities, such as, analysing advertisement campaign outreach or performing behavioural targeting, which involves tracking the on-line activities of users in order to deliver tailored ads to them. Speciﬁcally, ad targeting techniques, such as the one described by Deane et al. [2011], rely on data with users uniquely identiﬁed. For collecting such data, ﬁrstly the activity itself must be captured. After that, such activity must be associated with unique visitors. Visits and unique visitors are the basic web metrics required for nearly every web metric calculation [Kaushik, 2009]. As deﬁned by the Digital Analytics Association [Burby and Brown, 2007]: Deﬁnition 1. A visit is an interaction, by an individual, with a website consisting of one or more requests for an analyst-deﬁnable unit of content (i.e. page view). Deﬁnition 2. The KPI unique visitors refers to the number of inferred individual people (ﬁltered for spiders and robots), within a designated reporting timeframe, with activity consisting of one or more visits to a site. Each individual is counted only once in the unique visitor measure for the reporting period. At least six of the eight critical web metrics deﬁned by Kaushik [2009] depend on uniquely identifying users (i.e. unique visitors, time on page, time on site, bounce rate, exit rate, and engagement). The other two are visits and conversion rate. Conversion rate can be calculated by taking into account either unique visitors or visits, depending on business objectives. 2.2.1 Techniques for Capturing Web Activity There are four main ways of capturing the activity (a.k.a. clickstream data) of website users [Kaushik, 2007]: web logs, web beacons, JavaScript tags, and packet sniﬃng. This section describes these approaches and analyses their advantages and disadvantages. 14 4 3 2 1 Web Servers Log files Figure 2.1: Process followed by the technique based on web logs (adapted from Kaushik [2007]) 2.2.1.1 Technique Based on Web Logs Web logs are a classic system for capturing clickstream data. This technique is implemented by web servers and consists in registering one log entry each time there is a request to a web server by a web client. In such log-based systems, the web server triggers the log action when it receives a request from the client. Figure 2.1 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a resource (e.g. a web page) through its URL [Berners-Lee, 1994]. 2. The request is sent to a web server. 3. The server receives the request and creates a record in its log describing the request. 4. Finally, the server sends the resource to the user. The format of web server logs has been standardised by W3C21 . The standard proposes to describe log ﬁles as a sequence of log entries preceded by a header with one or more of the metadata described next: Version. Speciﬁes the version of the log ﬁle format used. 21 http://www.w3.org/TR/WD-logfile.html 15 Preﬁx c s r cs sc sr rs x Description Client. Server. Remote. Client to Server. Server to Client. Server to Remote Server. This preﬁx is used by proxies. Remote Server to Server. This preﬁx is used by proxies. Application speciﬁc identiﬁer. Table 2.1: Preﬁxes that can be declared in a web server log ﬁle Fields. Speciﬁes the ﬁelds recorded in the log. Such ﬁelds are deﬁned by using a preﬁx and a ﬁeld identiﬁer. The preﬁx refers to the information transfer mode, while the identiﬁer refers to an entry data type. For example, the identiﬁer cs-method refers to the HTTP method [Fielding and Reschke, 2014b] used for data transfer from client to server. Table 2.1 shows the list of available preﬁxes, while Table 2.2 shows the possible ﬁelds that can be registered, indicating if the ﬁeld requires or does not require to declare a preﬁx. Software. Identiﬁes the software that generated the log. Start-Date. The date and time at which the log was started. End-Date. The date and time at which the log was ﬁnished. Date. The date and time at which the entry was added. Remark. Comment information. Analysis tools should ignore data recorded in this ﬁeld. Listing 2.1 shows an example ﬁle log that includes a header in which the version used (line 1), the recording date (line 2), and ﬁelds registered (line 3) are speciﬁed. Registered ﬁelds correspond to the timestamp of particular requests, the HTTP method used, and the URI of the resource requested. The technique based on logs is the most accessible from all the techniques for recording web activity, since most web servers implement it. Also, there are 16 Identiﬁer date time time-taken bytes cached ip dns status comment method uri uri-stem uri-query Description Date at which transaction completed. Time at which transaction completed. Time taken for transaction to complete in seconds. Number bytes transferred. Records whether a cache hit occurred. IP [Postel, 1981] address and port. DNS name [Mockapetris, 1987]. Status code [Fielding and Reschke, 2014b]. Comment returned with status code. HTTP method. URI [Berners-Lee et al., 2005]. Stem portion alone of URI (omitting query). Query portion alone of URI. Preﬁx No No No Type Date Time Fixed No No Yes Yes Yes Yes Yes Yes Yes Yes Integer Integer Address Name Integer Text Name URI URI URI Table 2.2: Identiﬁers that can be declared in a web server log ﬁle 1 2 3 4 5 6 7 #Version: 1.0 #Date: 12−Jan−1996 00:00:00 #Fields: time cs−method cs−uri 00:34:23 GET /foo/bar.html 12:21:16 GET /foo/bar.html 12:45:52 GET /foo/bar.html 12:57:34 GET /foo/bar.html Listing 2.1: Example log ﬁle numerous tools that allow analysis of logs such as AWStats22 , Webalizer23 and Analog24 . The main criticism to this technique is that the information captured in log ﬁles is often too technical (HTTP errors [Fielding and Reschke, 2014b], browser types, etc.) to be used directly for business purposes (e.g. marketing intelligence). Similarly, the information recorded in the logs is too large, since it records the download of any resource provided by the web server (style sheets, images, etc.) 22 http://awstats.sourceforge.net http://www.webalizer.org 24 http://www.analog.cx 23 17 regardless it worths been measured or not. Therefore the log ﬁles must be conveniently ﬁltered prior to their analysis. The technique based in logs is able to register any activity that implies an HTTP request [Fielding and Reschke, 2014a] from the client to the server. However it is not able to register users’ behaviour on web pages that do not require a resource download operation. Such operations are becoming more common due to dynamic web pages. 2.2.1.2 Technique Based on Web Beacons The web beacons technique consists in placing banners, or 1 × 1 pixel transparent images, in web pages within img src HTML tags. When these tags are processed, a request to a tracking server is performed, what triggers the recording of the activity. Figure 2.2 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is sent to a web server. 3. The server sends the web page including an image of 1 × 1 pixels whose URL points to a data collection server. 4. When the web page is loaded in the user’s browser, a request of the image is sent to the data collection server. 5. The data collection server sends the image to the user, taking advantage of the HTTP protocol for managing cookies in the user’s device, and capturing user data, such as the web page that the user is viewing, the IP address of the user’s device, the timestamp of the activity, etc. Web beacons are used not only to capture information relating to the navigation of web pages; they can also can be inserted into email messages, so KPIs about an email sent can be recorded (e.g. number of email views). However, users often disable the download of images within their email applications. 18 4 Data Collector 5 3 2 1 Website Servers Figure 2.2: Process followed by the technique based on web beacons (adapted from Kaushik [2007]) 2.2.1.3 Technique Based on JavaScript Tags The JavaScript tags technique is the most used nowadays, existing multiple commercial tools that implement it (e.g. Adobe Marketing Cloud25 , IBM EMM26 , webtrends27 , and Google Analytics28 ). It consists in placing JavaScript [ECMA, 2011] code within HTML pages, so that, when an event to be measured is produced, the scripting code is evaluated. Such code includes a request to a tracking server. Thus, when the script is evaluated, the request is performed and the activity is recorded. Figure 2.3 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is sent to a web server. 3. The server sends the web page including a script of JavaScript code assigned to diﬀerent events (e.g. web page load, click on an active item). 4. When an event is triggered, its assigned JavaScript code is executed. Such code includes sending an HTTP request to a data collection server. 25 http://www.adobe.com/en/solutions/digital-marketing.html http://www.ibm.com/software/products/category/enterprise-marketing-management 27 http://webtrends.com 28 http://www.google.com/analytics 26 19 5. The data collection server processes the request, taking advantage of the HTTP protocol for managing cookies in the user’s device, and capturing user data, such as the web page that the user is viewing, the IP address of the user’s device, the timestamp of the activity, etc. Both, the technique based in web beacons and the technique based in JavasScript tags, allow collecting the web activity produced in multiple websites into a single data collection system. Figure 2.4 illustrates this scenario. 4 Site Analytics Services 3 2 5 1 Website Servers Figure 2.3: Process followed by the technique based on JavaScript tags (adapted from Kaushik [2007]) 3 4 2 Site Analytics Services 5 Website 1 servers 3 1 Data Collector 2 Website 2 servers Figure 2.4: Process followed by the tags or web beacons techniques for gathering data from multiple sites (adapted from [Kaushik, 2007]) 20 2.2.1.4 Technique Based on Packet Sniﬃng The packet sniﬃng technique consists in inspecting IP packages exchanged between web browsers and web servers. Packet sniﬀers can be implemented as a software layer over the web server, or as an independent module that intercepts and analyses the packages sent by web browsers before re-routing them to web servers. Figure 2.5 shows the process followed by this technique. The steps of this process are the following: 1. A user requests a web page through its URL. 2. The request is intercepted in its route to the web server by a packet sniﬀer that extracts the request data from the HTTP header of the request. 3. The packet sniﬀer re-routes the request to the web server. 4. The web server sends its response to the user’s browser. The response is intercepted by the packet sniﬀer, which extracts the information about the web page being served. Additionally, some sniﬀers add JavaScript tags to the web page, with the aim of obtaining additional information, once the browser processes the scripts. 5. The packet sniﬀer re-routes the response to the web browser. 4 5 1 2 Packet Sniﬀer 3 Website Servers Figure 2.5: Process followed by the technique based on packet sniﬃng (adapted from Kaushik [2007]) 21 2.2.2 Techniques for Identifying Unique Users This section describes the existing techniques for identifying unique users. Section 2.2.2.1 describes the widely used technique based on cookies, while Section 2.2.2.2 describes a novel technique based on the ﬁngerprint of the device used for browsing the Web. 2.2.2.1 Technique Based on Cookies With respect to the technique for uniquely identifying users, the one based on cookies is the most extended. A cookie is a message sent to a web browser from a web server. The browser stores the message and forwards it to the server each time the web browser requests a page from the server. The web server can send two diﬀerent kinds of cookies: 1. Session cookies, which have a lifetime limited to the user interaction with the website. 2. Persistent cookies, which remain on the machine of the user until a date of cookie expiration. The second type of cookies is the one used for user identiﬁcation. Each time a request comes from a web browser to a web server, the server checks if a speciﬁc cookie exists on the client. If the cookie exists, the server obtains it and reads a unique user identiﬁer stored on it. If the cookie does not exist, the server generates a new one, with a new unique user identiﬁer, and sends it to the client. Typically, cookies used to identify users contain a user identiﬁer, unique and anonymous, which identiﬁes the browser. Therefore, this type of cookies identiﬁes browsers used by users to access the Web. If a user uses multiple devices, the same user will be identiﬁed multiple times as a unique user (once per device). Cookies may be disabled in web browsers, or not supported by certain devices, such as smart TVs, so the user identiﬁcation technique based on cookies cannot be universally applied. In addition, the browser may be conﬁgured to delete cookies periodically, or they can be erased by anti-spyware applications. 22 2.2.2.2 Technique Based on Fingerprint The technique based on ﬁngerprint is an alternative to the technique based on cookies. This technique consists in identifying users from a number of attributes of the web browser or that can be queried through it. These attributes are sent from the web browser to the web server within the headers of each HTTP request, or are available once a page has been loaded in the browser so that attribute values can be sent to the web server using the JavaScript tags technique explained before. Eckersley [2010] demonstrated the eﬀectiveness of this technique by extracting and collecting the ﬁngerprints of 470,161 browsers. After analysing the data Eckersley [2010] obtained the following conclusions: • 83.6% of browsers have a unique ﬁngerprint. • In addition, 94.2% of the browsers with Adobe Flash Player29 or Java Virtual Machine30 installed have a unique ﬁngerprint. This is because, making use of these technologies, more data are available for diﬀerentiating one browser from another (e.g. the fonts installed on the system). • The entropy [Shannon, 1948] associated with the distribution of ﬁngerprints is 18.1 bits, which means that, if a browser is taken at random, at most one in 286,777 browsers share the same ﬁngerprint. • However, the ﬁngerprint of each web browser may change quickly. The number of unstable ﬁngerprints was of 37.4% during the period of study. The approaches for implementing user identiﬁcation based on browser ﬁngerprint are described next [Eckersley, 2010]. Use the ﬁngerprint as a global identiﬁer. The strength of this technique is that, while cookies can be removed, disabled or not supported by certain web browsers or speciﬁc devices (e.g. smartphones and set-top boxes), a ﬁngerprint can be always obtained. The weakness of this technique is that changes on the client (e.g. updating the browser version) imply changes on 29 30 http://www.adobe.com/products/flashplayer.html http://www.java.com 23 the ﬁngerprint and, therefore, unique user identiﬁcation fails, since there exist distinct ﬁngerprints that correspond to the same user. Use the ﬁngerprint along with the IP address assigned to the user. The strength of this approach is that it improves accuracy with respect to using ﬁngerprint as a global identiﬁer, since adding the IP address to the ﬁngerprint increments its entropy. However, the weakness of this approach is that it fails in environments where the IP may change, as occur when using DHCP [Droms, 1997]. Use the ﬁngerprint along with the IP address to regenerate cookies. The strength of this technique is that correspondences between the cookies and the ﬁngerprint of the users are maintained, so ﬁngerprint is used to identify users with a cookie previously assigned, when such cookie is lost due to cookie expiration or deleted by anti-spyware software. Eckersley [2010] proposes to construct the ﬁngerprint from the attributes described next. User-Agent header. This HTTP header contains information about the device used for requesting the web resource, like the browser version, and the operating system installed in such device. Accept header. This HTTP header determines the MIME [Freed and Borenstein, 1996] type of the content expected in a response to a HTTP request. E.g.: • The value text/html indicates that a web page in HTML format is expected. • The value image/jpg indicates that an image in JPEG format31 is expected. • The value text/* indicates that plain text is expected. • The value */* indicates that any kind of content is expected. 31 http://www.jpeg.org 24 Accept-Language header. This HTTP header determines the language expected in the response from a set of standard ones deﬁned by Alvestrand [1995]. Accept-Charset header. This HTTP header indicates the charset expected in the response (e.g. UTF-8 [Yergeau, 2003]). Accept-Encoding header. This HTTP header determines the encoding or compression format expected in the response. Frequent values are gzip or deﬂate. Cookies enabled. Represents the browser’s capability for accepting cookies. This attribute is set to true when the browser responds with cookie values when asked by the web server. Otherwise the attribute is set to false. Installed plugins. This attribute is composed by the names of the plugins installed in the web browser, their versions, and their assigned MIME types. Installed fonts. The fonts installed in the computer where the browser is running. Video. The video resolution and colour depth conﬁgured in such computer. Time zone. The time zone of the user. Session Storage. The capability of the browser for storing session data32 through key-value pairs. Local Storage. The capability of the browser for storing local data through key-value pairs. IE Persistence. The capability for persisting data when the user’s browser is Internet Explorer33 . This capability is enabled by modifying XML34 DOM (Document Object Model)35 elements through JavaScript code. 32 http://www.w3.org/TR/webstorage http://windows.microsoft.com/internet-explorer 34 http://www.w3.org/TR/xml11 35 http://www.w3.org/DOM 33 25 The User-Agent and Accept headers are sent via HTTP from the user’s browser to the web server. The rest of attributes are sent to the tracking server by applying the technique based on JavaScript tags explained in Section 2.2.1.3. An advantage of the browser ﬁngerprinting technique is that a thorough selection of ﬁngerprint attributes may lead to cross-browser identiﬁcation (i.e. assigning users to multiple browsers). Boda et al. [2012] have shown that a subset of browser-independent attributes is enough to uniquely identifying most users. A disadvantage of existing browser ﬁngerprinting techniques is the evolution of ﬁngerprint over time, since the ﬁngerprint makes use of attributes whose value may change. Therefore, the tracking server may interpret that two diﬀerent ﬁngerprints of the same browser correspond to diﬀerent browsers. To solve this problem, Eckersley [2010] describes an algorithm for detecting the evolution of the ﬁngerprints. This algorithm consists in measuring the lexical similarity between pairs of diﬀerent ﬁngerprints. If this similarity exceeds a threshold (θ = 0.85), it is considered that the two ﬁngerprints represent the same user. This algorithm can be signiﬁcantly improved if diﬀerent weights are assigned to the ﬁngerprint attributes, according to their importance, or if the time elapsed between ﬁngerprints registration is taken into account. 2.2.3 Conclusions The metric unique visitors measures the audience of a site in terms of people that have accessed site contents. Counting unique visitors of websites is an essential activity in order to perform Web Analytics, since many Web Analytics KPIs depend on individuals counted only once (e.g. new visitors, return visitors, etc.). There are many techniques to capture user activity, such as recording server logs, using web bugs or JavaScript tags that make use of HTTP, HTML, and JavaScript capabilities for triggering events that cause the registration of such activity, or inspecting complex low-level network packets exchanged between browsers and web servers. The techniques most used for uniquely identifying users from captured web activity are the ones that combine cookies and web bugs or JavaScript tags [Harding 26 et al., 2001]. This approach is being aﬀected by several factors, such as strict privacy restrictions implemented by web browsers [Kaushik, 2007] or the use of new devices for navigating the Web that do not support cookies (e.g. many set-top boxes and certain video game consoles). Furthermore, several security programs, such as antispyware ones, remove cookies periodically, making it diﬃcult to trace recurring visits to websites [Kaushik, 2007]. Thus, these security measures, enabled to protect the privacy of users, aﬀect basic aggregated metrics obtained with Web Analytics, from which valuable business insights can be derived, such as the number of unique visitors of a website, or the bounce rate. Open Research Problem 3. An alternative to cookies for uniquely identifying users consists in capturing distinctive technical attributes of the system used by such users to navigate the Web (i.e. their browser ﬁngerprint). While Eckersley [2010] demonstrated the eﬀectiveness of this technique, such technique is not entirely accurate, since browser ﬁngerprint is built from attributes that evolve over time. Thus, changes in values of ﬁngerprint attributes lead to incorrectly accounting new users. 2.3 Technique for Detecting the Evolution of Temporary Records Li et al. [2011] describe a method for detecting the evolution of temporary records. This method takes into account the time elapsed between the capture of the records for being compared, introducing the concept of time decay and deﬁning the probabilities described next. Deﬁnition 3. Disagreement decay is the probability that an entity changes the value of an attribute A within the time Δt. This probability is denoted by d= (A, Δt) [Li et al., 2011]. Deﬁnition 4. Agreement decay is the probability that two diﬀerent entities share the same value of A within the time Δt. This probability is denoted by d= (A, Δt) [Li et al., 2011]. 27 In addition, Li et al. [2011] describe two algorithms to learn agreement and disagreement decays from existing training data, and diﬀerent ways of calculating the similarity between two records taking into account the probabilities deﬁned above, and the cardinality of the attributes (e.g. single-valued or multivalued). Finally, the three algorithms for clustering temporal records described next are provided. 2.3.1 Early Binding Algorithm [Li et al., 2011] This algorithm processes the records in ascending time order. For each record, the algorithm creates a new cluster, or adds it to an existing cluster. Speciﬁcally, given a record r and a set of clusters C1 , ..., Cn , the algorithm consists in the execution of the following steps: 1. Calculate the similarity between r and each Ci , i ∈ [1, n]. 2. Let sim(r, Cx ) be the similarity between r a cluster Cx , choose the cluster C with the biggest similarity. (a) If sim(r, C) > θ add r to C, where θ is a threshold that indicates a high similarity. (b) Otherwise, create a new cluster Cn+1 for r. 3. Update the signature of the cluster (i.e. cluster description) to which r as been added. Given a set of records for being clustered, the computational complexity of this algorithm is O(n2 ) (i.e. quadratic complexity), because the algorithm compares once each pair of records. 2.3.2 Late Binding Algorithm [Li et al., 2011] The strength of this algorithm is that, unlike the previous algorithm in which decisions were made early, this algorithm stores information about all the comparisons between records and clusters and takes the decisions at end the process, improving accuracy. 28 To store the information of the comparisons the algorithm makes use of a data structure that stores a bipartite graph (Nr , NC , E) in which each node nr represents a record, each node nC represents a cluster, and each edge (nr , nC ) ∈ E is labelled as the probability for a record r to belong to a cluster C. The algorithm is implemented in two phases, called Evidence Collection and Decision Making. 1. The Evidence Collection phase creates the bipartite graph and calculates the weight for each edge. This step behaves in a similar way to the previous algorithm, but storing all the probabilities instead of taking early decisions. 2. The Decision Making phase deletes edges with lower weights until each record r belongs to a unique cluster C. The weakness of this algorithm is that it adds a further analysis phase which increments processing time, in comparison to early binding which runs in a single phase. In addition, early binding has lower memory usage requirements than late binding, as for each cluster the early binding algorithm maintains only the last record that was added. In contrast, the late binding algorithm maintains all records within the cluster as the cluster signature. The computational complexity of late binding algorithm is also O(n2 ). 2.3.3 Adjusted Binding Algorithm [Li et al., 2011] The strength of this algorithm is that, unlike previous algorithms, it allows comparing records with clusters created after the arrival of any record, improving accuracy over the previous algorithms. This algorithm starts after executing any of the previous algorithms, and consists in the execution of the following steps: 1. Initialisation. Set the initial assignment as the result of early of late binding. 2. Estimation. Compute the similarity of each record-cluster pair as it is done in the ﬁrst step of late binding. 3. Maximisation. Chose the clustering with the maximum probability as in step 2 of late binding. 29 4. Termination. Repeat steps 2-3 until the results converge or oscillate. The weakness of this algorithm is that it add additional steps of quadratic computational complexity (O(n2 )) that have to be executed after running early binding or late binding. Thus, the number of iterations to run over data makes this algorithm less scalable than the other ones. 2.3.4 Conclusions One of the objectives of this thesis is to study the feasibility of a novel browser identiﬁcation technique in a real-time scenario, where the tracking server assigns ﬁngerprints to particular users as they arrive to the system. Of the three algorithms described before, the most suitable for this scenario is early binding due to the reasons explained next. • The adjust binding approach is discarded, due to the scalability reasons explained before. • In addition, in a real time scenario, there is always a set of zero or more clusters created previously, and only one record to classify on each invocation of the algorithm, so the computational complexity early binding and late binding is reduced to O(n) (i.e. linear complexity). Therefore, the algorithm early binding is the most suitable for achieving the objective of this research. 2.4 Social Media Analysis Applied to Market Research Internet has transformed the way in which consumers’ word-of-mouth (i.e. nonformal exchange of information between at least two individuals, which is perceived as trustworthy) is created and propagated [De Bruyn and Lilien, 2008; Gupta and Harris, 2010; Kozinets et al., 2010]. Digitised customer feedback information (i.e. electronic word-of-mouth or e-WOM) can be accessed any time 30 and anywhere through diverse social media such as blogs, social networks, customer reviews, and forums, which further increases its inﬂuence among fellow customers [Dellarocas, 2003; Schindler and Bickart, 2005]. Nowadays, a person who is looking for information about some product is not limited to asking to friends or relatives about it, instead he or she can expand this search by consulting user reviews, specialised blogs or even brief opinions stated by microbloggers. According to a survey by Nielsen [2012a], 70% of global consumers trust buyer’s reviews, while 92% of consumers indicate they trust recommendations from peers, family and word-of-mouth above other forms of advertising. This shopping scenario, if disruptive for traditional business models, opens up opportunities for corporations to grow, innovate and improve their relationship with customers [Hennig-Thurau et al., 2010]. Marketers are in an advantageous position to monitor and derive a beneﬁt from this unparalleled volume of consumer conversations, which are increasingly taking place in social media channels. Accordingly, companies have reorganised their traditional methods of gathering customer opinions (such as polls, and surveys) in order to adapt them to these new media. This novel source of consumer data is not only extremely massive and complex but also completely unﬁltered, which facilitates a real-time, deeper comprehension of consumer’s needs and thoughts [Han et al., 2014]. This improves in turn the level of responsiveness to reputation crisis, emergencies and situations alike. However, although the proliferation of social media has allowed organisations and companies to collect a massive amount of information about user’s opinions, the majority of this user-generated content is unstructured and therefore, hard to interpret, classify and summarise. In order to solve these new requirements, ﬁelds such as Sentiment Analysis and Opinion Mining [Liu, 2012] have developed technology to automatically analyse user-generated content. Research in these areas started to work in several aspects, such as subjectivity detection, automatic classiﬁcation of opinionated texts, and automatic opinion summarisation. At the beginning, the main objective of these ﬁelds was limited to summarising the overall opinion expressed in these usergenerated texts, and generally based on the distinction between positive and negative comments conveyed by buyers. However, the task started to evolve [Cambria et al., 2013; Cambria and White, 2014] and currently there is a broader 31 interest to carry out a very ﬁne-grained analysis of the available data [Gangemi et al., 2014]. The content of the user-generated texts is so rich and varied that it can be analysed from very diﬀerent perspectives. For example, in works such as Asur and Huberman [2010]; Joshi et al. [2010]; Sadikov et al. [2009] authors make predictions about the proﬁt of movies from user-generated content of microblogs, reviews and blogs. However, the validity of social metrics [Sterne, 2010] depends to a large extent on the population over which they are applied. Social media users cannot be considered a representative sample until the vast majority of people regularly use social media. Therefore, until then, it is necessary to identify the diﬀerent strata of users in terms of socio-demographic attributes (e.g. gender, age or geographical precedence) in order to weight their opinions according to the proportion of each stratum in the population [Gayo-Avello, 2011]. As an example, the comparison performed by Mislove et al. [2011] between the U.S. and Twitter populations along three axes (place of residence, gender and race) showed that Twitter users signiﬁcantly overrepresent the densely population regions of the U.S., are predominantly male, and represent a highly non-random sample of the overall race/ethnicity distribution. 2.4.1 KPIs Based on Social Media Analysis In the world of marketing and business, predicting real-world outcomes is a challenging task that normally requires indicators from heterogeneous data sources. For instance, traditional media content analysis has been used to forecast the ﬁnancial market [Chan, 2003; Fung et al., 2003; Tetlock et al., 2008], and several works have demonstrated connections between online content and customer behaviour (e.g. purchase decisions). Since social media feeds can be eﬀective indicators of real-world performance [Asur and Huberman, 2010], diﬀerent forecasting models have been studied for using online chatter to predict real world outcomes related to the sales of diﬀerent kinds of goods, such as movies [Asur and Huberman, 2010; Mishne and Glance, 2006; Zhang and Skiena, 2009] or books [Gruhl et al., 2005]. Predictive models range from gross income predictions [Asur and Huberman, 32 2010; Joshi et al., 2010; Mishne and Glance, 2006; Sharda and Delen, 2006; Zhang and Skiena, 2009] to revenue estimations per product distributor (i.e. stores that oﬀer a product or service) [Mishne and Glance, 2006] or spike predictions in sales ranks [Gruhl et al., 2005]. Besides, social media plays an increasingly important role in how customers discover and engage with various forms of content, including traditional media, such as TV. In this line, a study by Nielsen [Subramanyam, 2011] found correlations between online buzz and TV ratings. Many social media have started to be exploited to obtain the indicators that enable such prediction models (e.g. from Twitter [Asur and Huberman, 2010], blog feeds [Gruhl et al., 2005; Mishne and Glance, 2006], review texts [Joshi et al., 2010], online news [Zhang and Skiena, 2009]). Indicators are based on volume, sentiment analysis, or combinations between them and economic data or product metadata. Volume-based indicators can be simple or composed. Among the simple predictors we ﬁnd the raw count of posts referring to a brand [Gruhl et al., 2005; Mishne and Glance, 2006; Zhang and Skiena, 2009], the number of mentions for a brand (i.e. count of entity references, taking into account that one post can mention the same entity multiple times) [Zhang and Skiena, 2009], or the number of unique authors that refer to the brand. Among composed predictors we ﬁnd the post rate [Asur and Huberman, 2010] (which denotes the rate at which publications about particular topics are created, i.e. the number of posts about a topic divided by time) and the post-per-source (which measures the average number of posts published about a topic in particular feed sources, e.g. a set of forums). These volume-based indicators have been demonstrated to be eﬀective. For example, spikes in references to books in blogs are likely to be followed by spikes in their sales [Gruhl et al., 2005]. Sentiment analysis-based indicators are based on the hypothesis that products that are talked about positively will produce better results than those discussed negatively, because positive and negative opinions inﬂuence people as they propagate through a social network. Basic sentiment-based predictors include the numbers of positive, negative and non-neutral posts (i.e. positive plus negative) about a brand [Mishne and Glance, 2006]. Composite indicators include the positive and negative ratios [Zhang and Skiena, 2009] (i.e. the number of positive 33 or negative posts divided by the total number of posts), and the mean or the variance of sentiment values [Mishne and Glance, 2006]. Other important composite sentiment-based indicators include the Net Promoter ScoreSM (NPS36 ), the polarity index and the subjectivity index. NPS is commonly used to gauge the loyalty of a ﬁrm’s customer relationships [Zhang and Skiena, 2009]. NPS can be approximated by dividing the diﬀerence of positive and negative posts by the total number of posts. The polarity index is calculated in diﬀerent manners: by dividing the posts with positive sentiment by the post with negative sentiment [Asur and Huberman, 2010; Mishne and Glance, 2006], or by dividing the posts with positive sentiment by the number of non-neutral posts [Zhang and Skiena, 2009]. Subjectivity is measured by dividing the number of non-neutral posts by the number of neutral or total publications [Zhang and Skiena, 2009]. Low-level textual feature-based indicators, combined with metadata features, have been also demonstrated to achieve a good performance [Joshi et al., 2010]. Such textual features include term n-grams, part-of-speech n-grams and dependency relations. All these indicators can be combined with other numerical and categorical predictors, such as product metadata [Joshi et al., 2010; Mishne and Glance, 2006; Sharda and Delen, 2006; Zhang and Skiena, 2009], advertising investment, overall budget [Joshi et al., 2010; Zhang and Skiena, 2009], number of product distributors [Mishne and Glance, 2006; Zhang and Skiena, 2009], or even, the Time Value of Money [Zhang and Skiena, 2009]. The forecasting models used range from linear or logistic regression models [Asur and Huberman, 2010; Joshi et al., 2010; Zhang and Skiena, 2009] to knearest neighbour models (k-NN) [Zhang and Skiena, 2009]. Gruhl et al. [2005] base their models on time-series analysis and construct a moving average predictor [Box and Jenkins, 1990], a weighted least squares predictor, and a Markov predictor. Sharda and Delen [2006] convert the forecasting problem into a classiﬁcation problem by discretising the continuous predicted variables to a ﬁnite number of categories, and then they use a neural network model for performing the classiﬁcation. Finally, the scale of the data is a key aspect when analysing online content. To 36 Service mark owned by Bain & Company (http://www.netpromotersystem.com) 34 get an idea, the work presented by Asur and Huberman [2010] uses 2.98 million tweets from 1.2 million users, with feeds extracted hourly during three months; the Nielsen study about social TV uses data from 250 TV programs and 150 million social media sites; and in Gruhl et al. [2005] the authors analyse the daily rank values of 2,340 books over a period of four months. 2.4.2 Conclusions The proliferation of new social media channels provides marketing practitioners with a huge quantity of data about consumer preferences, likes and dislikes. The large amount of data provides more and richer information that is, however, lost because of the lack of means if it is to be analysed by using manual methods. In comparison with traditional quantitative techniques such as questionnaires, the collection of opinions extracted from social media sources means less intrusion since it enables the gathering of spontaneous perceptions and desires of consumers, without introducing any bias. In addition, the possibility of doing this in real time poses a clear advantage over other techniques based on retrospective data. Overall, this allows for a more eﬃcient and complex business decision making based on a comprehensive assessment of users propensity to buy and concrete opinions shared about a brand or product. Open Research Problem 4. While there are approaches for obtaining KPIs derived from the volume of posts about the opinionated entities, or the polarity of opinion about them, there are other KPIs that cannot be obtained due to the lack of user-generated-content-analysis techniques that allow to classify consumers according to multiple socio-demographic and psychographic attributes commonly used in the ﬁeld of marketing for consumer segmentation. The next section describes the marketing and psychological backgrounds upon which the set of socio-demographic and psychographic attributes are based. 2.5 Marketing Background Marketing is the process of communicating the value of a product or service to consumers for the purpose of selling that product or service to them. If marketing 35 has one goal, it is to understand the most adequate way to reach consumers to oﬀer them the product or service recommended for them. To that extent, it is important to get familiarised with the various buying processes that consumers go through depending on the product at hand. Furthermore, what is considered in fact of great value is being able to detect the diﬀerent stages that consumers have to go through during this process, as well as the conditioning factors that produce a shift from one stage to another. In the past, the construction of the media plan for a media agency was far less complicated as there were fewer media, i.e. TV, printed newspapers, etc. Back then, placing an advert in television would guarantee the delivery of the marketing message to the consumer. However, nowadays the task of reaching the consumer is not that straightforward anymore due to the fragmentation of both traditional and digital media. Marketing teams today are swimming in data —online, oﬄine, internal, external, customer demographics, Web Analytics, media modelling, visibility, impressions, click-through rates, conversions, engagement metrics (see [Burby and Brown, 2007] for some examples). The most important thing to remember is that all that the brand teams really want is to connect with its customers, or potential customers, in a personal and meaningful way. The goal for marketers today is ﬁrst to tie all their disparate proprietary data together. But that’s only step one. To send appropriate messages to receptive consumers, brands need to be able to identify and segment customers and prospective customers using predictive attributes: What are they likely to buy? How are they thinking? And what is the best way to reach them? To optimise media spending, marketers also need to look for solutions that eﬀectively manage their campaigns and divide consumers into psychographic and demographic clusters —a way for marketers and their agencies to overlay proprietary data and look for the right targets based on who they are, what they have done, what they like and what they’re likely to buy. Thus consumers are beneﬁted with pertinent and meaningful communications directed by the brands, which take into account their context, preferences and particular needs, avoiding the over-saturation of massive marketing. There is nothing worst to a customer than receiving “junk” advertisement on something that they do not need, want 36 or that they already have. This section presents the theoretical marketing backgrounds related to the work presented in this thesis. We introduce the Consumer Decision Journey [Court et al., 2009] and Marketing Mix [Borden, 1964] models, as well as a summary of psychological research on human emotions, which are conceptual frameworks upon which the analytic tools we propose are based on. Additionally, we describe the diﬀerent kind of media that marketers must deal with nowadays and describe the diﬀerent kinds of tools used for solving the problems arisen on each media type. 2.5.1 The Consumer Decision Journey The Purchase Funnel, proposed in the early twentieth century by Lewis [1903], is a marketing model that illustrates the purchase process in several stages, from the moment when a customer is aware of the existence of the product (awareness) to the moment when he or she buys the product (purchase). The model evolved during the last years and at present there are many diﬀerent purchase funnel models, some of them with many diﬀerent intermediate stages. However, the basic conceptual framework and stages remain the same in all of them [De Bruyn and Lilien, 2008; Franzen and Goessens, 1999]. Modern versions of the purchase funnel model take into account the inﬂuence of Internet and social media in the decision-making path of the customer, and also include a postpurchase stage. The version of the purchase funnel proposed by Forrester [Noble et al., 2010] is a good example of the introduction of the new technologies and social media to the classic Elmo Lewis’ model [Lewis, 1903]. This work highlights the great inﬂuence of user-generated content on the ﬁnal purchase decision of the customers. In the model proposed by McKinsey [Court et al., 2009], the Consumer Decision Journey, the traditional funnel shape of the decision journey is transformed in a purchasing loop and the notion of trigger (as the cause because of which potential customers start to investigate the brand and therefore enter into the purchase funnel) is introduced. Knowing the exact stage of the decision journey where the customer is located is essential in order to design speciﬁc promotional campaigns, interact with 37 customers at the appropriate touch-points and improve customer relationships management (CRM) systems [Edelman, 2010]. To discover this, the analysis of the diﬀerent social media channels is crucial, since the online conversations between potential customers play a very important role in the purchase decision pathway [Divol et al., 2012]. Findings of Ng and Hill [2009] and Gupta and Harris [2010] revealed that consumers do actively search the Web for non-commercial bias opinions prior to making a purchase decision. Pookulangara and Koesler [2011] state that, in addition to transforming the evaluation and purchase stages, online social networks enable consumers to become advocates of their preferred brands. Related work by other researchers found that online consumer conversations inﬂuence purchase decisions in a variety of ways, which include reinforcing of product involvement [Wang et al., 2012]. De Bruyn and Lilien [2008] studied which factors aﬀect consumers in the various phases of their online decision making processes, and found that while tie strength (i.e. closeness of relationship between two individuals) facilitates awareness, it has no apparent power over triggering interest or decision to buy. In summary, it is safe to say that social media have drastically changed the shopping experience, which calls for further research in this area. While the shopping experience of some goods involves very little deliberation and an emotional response (e.g. greeting cards), other products require deeper forethought either because its cost is signiﬁcantly higher or because the consequences of making a good or bad decision are much more profound (e.g. life insurance, mortgages) [Vaughn, 1986]. Similarly, the duration and intensity of the diﬀerent purchase phases might be aﬀected by the features of the product being purchased or evaluated (e.g. novelty, price) as well as by buyers’ characteristics (e.g. their previous experience with the brand) [van Bruggen et al., 2010]. In this work we adopt the following, widely agreed, purchase stages: awareness, evaluation, purchase, and post-purchase experience. This straightforward model can be easily applied to a wide variety of products and purchase contexts. Therefore, our aim is to use a consumer decision-making model whose basic stages can be reasonably traceable in a big data scenario consisting of online consumer texts, rather than using a sophisticated conceptual model that incor- 38 Awareness Evaluation Purchase Post-purchase Experience Figure 2.6: Consumer Decision Journey stages adopted in this thesis porates customer experience complexity to its fullest. Figure 2.6 illustrates the model adopted as conceptual framework in this work. The ﬁrst stage, awareness, refers to the very ﬁrst contact of the customer with the product or brand, with or without the desire of purchase. Customers usually convey their interest through references or expressions about the advertising campaigns. In the evaluation phase, the customer already knows the product or brand and evaluates it, frequently with respect to other similar products or brands. In this step, buyers actively investigate the brand in comparison with its competitors (asking for opinions, formulating questions, consulting product reviews, etc.) and/or express their preference towards a speciﬁc brand or product. In the purchase stage customers either explicitly convey their decision to buy the product or make comments referring to the transaction involved when buying the item. Finally, the post-purchase experience phase refers to the moment when customers, having tried the product, criticise, recommend it or simply talk about their personal experience with it. 2.5.2 The Marketing Mix The concept of “Marketing Mix” was coined by Borden [1964] who identiﬁed twelve marketing elements to manage business operations in a more proﬁtably way. McCarthy and Brogowicz [1981] reduced these twelve elements to just four: Product, Price, Promotion, and Place (the “4P’s”). These four elements usually imply diﬀerent subcategories that can vary depending on the interests of the marketing company. For example, the element Product could be subdivided into Quality, Design and Warranty; within Place one could distinguish Point of Sale and Customer Service, and Promotion has also diﬀerent subcategories such as 39 Product Quality Design Warranty Place Point of Sale Customer Service Price Price Promotion Promotion Sponsorship Loyalty Marketing Advertisement Table 2.3: Subcategories of the Marketing Mix elements Sponsorship, Loyalty Marketing, and Advertisement (that can also be divided into diﬀerent subtypes of advertisement depending on the media used). The 4P’s Marketing Mix framework is used by marketers from all over the world, taking it as a basis to develop their operational marketing plans. Table 2.3 identiﬁes the subcategories in which we have divided each element of the Marketing Mix framework. In this thesis, we have developed classiﬁers for the following subcategories: “quality”, “design”, “point of sale”, “customer service”, “price”, “promotion”, “sponsorship” and “advertisement”. 2.5.3 Research on Human Emotions Sentiment studies have been present in diﬀerent areas and for diﬀerent purposes. Many researchers have pursued diﬀerent approaches to analyse human emotions, feelings, opinions, preferences and evaluations, and, unfortunately, there is no agreement on the nature and number of basic human emotions. From the psychology ﬁeld, we can distinguish two main traditions [Gendron and Feldman Barrett, 2009]: 1. the basic emotion tradition, founded on the study of the basic and instinctive emotions, mainly with an evolutionary approach, and 2. the appraisal tradition, focused on the individual evaluation of world objects. Within the ﬁrst approach, we ﬁnd the works of Plutchik [1989] and Ekman [2005], among others. Plutchik proposed a taxonomy of eight multidimensional emotions grouped into four categories, namely, joy-sadness, trust-disgust, fearanger, and surprise-anticipation; whereas Ekman diﬀerentiated six primary uni- 40 versal (innate and cross-cultural) emotions, which can be recognised from facial expressions: happiness, sadness, anger, disgust, surprise, and fear. One of the main representatives of the second tradition is Arnold [1960], who created a classiﬁcation of eleven primary emotions (anger, aversion, courage, dejection, desire, despair, fear, hate, hope, love, sadness). Following also the appraisal tradition, but applying the prototype approach [Rosch, 1978], we ﬁnd the work of Shaver et al. [1987], who distinguished six primary emotions (love, joy, anger, sadness, fear, and perhaps, surprise) with (related) groups of descriptors drawn from a lexicon of words with emotional connotation (for instance, nervousness and anxiety as descriptors of fear). A comprehensive deﬁnition of emotion that comprises all these approaches in this ﬁeld is given by [Kleinginna and Kleinginna, 1981]: Emotion is a complex set of interactions among subjective and objective factors, mediated by neural-hormonal systems, which can 1. give rise to aﬀective experiences such as feelings of arousal, pleasure/displeasure; 2. generate cognitive processes such as emotionally relevant perceptual eﬀects, appraisals, labelling processes; 3. activate widespread physiological adjustments to the arousing conditions; and 4. lead to behaviour that is often, but not always, expressive, goal directed, and adaptive. Since emotions are aﬀected by the context in which they are produced [Phillips and Baumgartner, 2002], the taxonomies proposed in the psychological domain were adapted for consumption-related studies, the ﬁeld in which we are interested. In this sense, Richins [1997] elaborated the Consumption Emotions Set (CES) taxonomy, which distinguished between emotions and mood, and grouped emotions into sixteen clusters (e.g. fear: scared, afraid and panicky). In the same line, Westbrook and Oliver [1991] showed that aﬀective experiences (which can be understood here as emotions) coexisted and were related to consumer satisfaction 41 and dissatisfaction, which is the traditional approach used to measure consumer experiences. In line with the Artiﬁcial Intelligence studies, Ortony et al. [1990] proceeded on the assumption that progress in psychological research on emotion could be attained through an analysis of the cognitions that underlie emotions. To this end, their account of emotions is in terms of classes of emotions types, and not in terms of speciﬁc words. An important guiding principle in developing the theory was that it could be suﬃcient to permit empirical testing, such as computationally tractable model of emotions to be used in Artiﬁcial Intelligence. Obviously, this perspective is also very relevant for our work. In this thesis we have established the categories of sentiments shown in Table 2.4 as our conceptual framework. This conceptual framework is based on Ekman [2005]; Richins [1997]; Shaver et al. [1987], and consists of the following four polarized categories: SD (satisfaction-dissatisfaction), TF (trust-fear), LH (love-hate) and HS (happiness-sadness), where the ﬁrst one, SD, subsumes the other three (i.e. a text classiﬁed as TF, LH or HS is also categorized as SD). This decision is based on previous works (e.g. Oliver [1989]; Westbrook and Oliver [1991]) that conﬁrm that the satisfaction-dissatisfaction scale conceals much more ﬁne-grained sentiments. Finally, Table 2.5 shows the relationship between our conceptual framework and the Wordnet-Aﬀect taxonomy [Valitutti et al., 2004], already introduced in Section 2.1, meaning that a given category of our conceptual framework subsumes the corresponding set of categories in the Wordnet-Aﬀect taxonomy. Category SD TF HS LH Polarity + satisfaction trust happiness love − dissatisfaction fear sadness hate Table 2.4: Categories for the sentiment classiﬁcation, organised according to their polarity 42 Category Satisfaction Dissatisfaction Happiness Sadness Love Hate Trust Fear Wordnet-Aﬀect Liking, Gratitude, Positive expectation, Calmness, Aﬀection, Contentment. Dislike, Annoyance. Self pride, Joy. Shame, Anxiety, Sadness. Love. Hate, Indignation, Bad temper, Fury, Huﬃness, Dander. Positive hope, Fearlessness. Negative Fear. Table 2.5: Relations between the conceptual framework of emotions used in this thesis and the Wordnet-Aﬀect taxonomy 2.5.4 Owned, Paid and Earned Media Marketers distinguish three types of media: owned, paid, and earned [Corcoran, 2009]. • Owned media refers to those media controlled by brands, such as their websites, mobile apps, blogs and any communication channel that brands may have on social media platforms like Twitter, Facebook or Instagram37 , to mention just a few. The role of this media is to build longer-term relationship with existing customers. • Paid media refers to the media that brands pay to leverage a channel. It includes traditional oﬄine mass media channels (e.g. TV, radio, print and out of home advertising, sponsorships), as well as online channels like display ads and paid search. • Earned media refers to opinions about the brands exchanged between consumers, and brands’ contents sharing through word-of-mouth mechanisms. The content published in social media is mostly of this kind. Brands must listen carefully to what happens in all these channels, as if they were customers. Companies struggle to integrate and analyse the huge volume of interactions coming from paid, owned and earned media, with the aim of achieving 37 http://instagram.com 43 a holistic 360◦ approach to brand communication, that will lead to more eﬃcient and eﬀective marketing campaigns. 2.5.5 Marketing Technology In the online marketing ﬁeld, Big Data Analytics is a big challenge that companies and agencies are facing with applications that address diﬀerent brand-customer communication dimensions individually. Such applications are described next. Programmatic advertising. These systems are oriented to automatise the process of paid-media planning (i.e. buying of advertisement spaces), performing Big Data analysis for ﬁnding ad placement plans that should lead to optimum performance KPIs (e.g. maximising the click through rate of display advertising). Demand-Side-Platforms (DSPs) like MediaMath38 , or Data Management Platforms (DMPs) like Oracle Bluekai39 belong to this category. The scope of these applications is limited to sites with web advertising capabilities. Site analytics and digital customer experience management. These systems are devoted to analyse and optimise brand-customer communication processes on owned digital media (i.e. sites owned by the brand). Within this group we ﬁnd the following kind Web Analytics applications (e.g. Adobe Marketing Cloud, IBM EMM, webtrends, and Google Analytics) and solutions for digital customer experience management and customer behaviour analysis (e.g. IBM Tealeaf40 ). The scope of these applications is generally limited to brands’ sites and microsites. Recently, services like Google Analytics have extended measurement capabilities to mobile apps. Social media analytics and social CRM. Within these systems we ﬁnd applications for measuring brand reputation on earned media (i.e. media not 38 http://www.mediamath.com http://www.bluekai.com 40 http://www-01.ibm.com/software/info/tealeaf 39 44 controlled by the brand, like social networks, Web 2.0, etc.) and applications for social CRM (i.e. community management in social networks). Regarding social media monitoring applications, given the massive amount of posts published every day through diﬀerent social media, the fact of having a system able to evaluate the global sentiment towards an entity (e.g. brand or product) is becoming a must for marketing experts. This is one of the main reasons for the increased attention that sentiment analysis has received in these last few years. Actually, there are already several commercial tools able to provide a polarity ﬁgure measuring the attitude towards a brand or any other queried topic, such as Radian641 , Sysomos42 and Brandwatch43 . Market analysts and social media researchers in general use these tools and other similar ones to classify opinions about brand sentiments in terms of polarity (positive or negative). The State of the Art regarding techniques for sentiment analysis is described in Section 2.6.3. Social CRM applications implement features for monitoring social media opinions and conversation, and communicating with the consumers using the same social networks where the opinions have been captured. Example applications of this kind are HootSuite44 and TweetDeck45 . 2.5.6 Conclusions There exists tons of data related with advertising and communication activities that are underexploited, many of which are currently in such format that cannot be treated, processed or used. Companies are sitting on “gold mines” without even realising, and the power of data utilisation is beyond measure. The ﬁrst step to inﬂuence social media conversations is to understand them to its fullest. In other words, managers and marketers need to know and understand the content of these conversations and, further, be able to classify them into categories that are relevant for their day-to-day tasks such as Consumer 41 http://www.salesforcemarketingcloud.com http://www.sysomos.com 43 http://www.brandwatch.com 44 http://hootsuite.com 45 http://tweetdeck.twitter.com 42 45 Decision Journey stages and Marketing Mix elements. In the ﬁrst case (purchase funnel stages), to monitor in real time and accordingly react to the experiences and needs that those customers are sharing, advertisers must know in which purchase stages are consumers gained and lost in order to reﬁne touch points, impact consumers and achieve the desire result (e.g. a transaction). Other applications are, among others, the analysis of shopping behaviour of users in comparison with brands from the rivals, to conﬁrm whether any particular marketing strategy has had the desired eﬀect on purchase attitudes (e.g. if there has been a rise in awareness after the launch of an advertising campaign), to explore whether the distribution of users in Consumer Decision Journey stages is seasonally aﬀected, etc. In the second case (Marketing Mix elements), uncovering the exact content of the dialogues that costumers are having, e.g. which product attributes worry them the most, lets marketers and advertisers have a better track of consumers’ mind-set. The combination of these two categories (purchase funnel stages and marketing mix elements) gives answers to extremely signiﬁcant questions that have an inﬂuence on the position of the brand in the market such as: which are the features by which a brand is known, which are the elements that are driving awareness to the brand (i.e. price), which characteristics of the product make it desirable and which characteristics are not relevant. Open Research Problem 5. While there are tools for analysing brand health in earned media through the analysis of the polarity of the opinions produced by consumers when talking about the brand, there are not approaches that speciﬁcally address the classiﬁcation of electronic word-of-mouth according to the Consumer Decision Journey, useful for market analysis purposes. Open Research Problem 6. Additionally, there are not tools for identifying the Marketing Mix elements consumers are referring to when publishing opinions about brands in social media. 46 2.6 Analysis of Social Media Content This section describes existing activities and techniques for the analysis of the textual contents published in social media that are related with the contributions of this thesis. Speciﬁcally, we describe the lemmatisation and part-of-speech tagging tasks [Jurafsky and Martin, 2009] and introduce content normalisation approaches [Alegria et al., 2013; Sproat et al., 2001], which are fundamental preliminary steps in all the techniques provided by this thesis. After that, we describe the related work regarding sentiment analysis [Liu, 2012] and discuss existing research results on automatic identiﬁcation of wishful sentences [Goldberg et al., 2009], which are the areas where we have found more similarities with our work both in terms of objectives and used technologies. Finally, we describe the existing techniques for detecting the gender and place of residence of social media users, upon which our techniques for recognising sociodemographic attributes are based. 2.6.1 Lemmatisation and Part-Of-Speech Tagging Many content-analysis techniques rely on particular Natural Language Processing tools, to lemmatise (i.e. grouping together diﬀerent inﬂected forms of a word to process them as one single element) and to add morphological information (i.e. part-of-speech, to distinguish between homographs such as “walk-verb” or “walknoun”, verb tense, and person). Thus, a text such as This Volkswagen I got my eye on is so sexy gets the representation shown in Table 2.6, where the ﬁrst column shows the words in the text, the second column shows lemmas corresponding to each word, and the third column the part-of-speech tag, where DT means determiner, NN means common noun singular, NNP means proper noun singular, PRP means personal pronoun, VBD means verb in past tense, IN means preposition, VBZ means verb in present tense in third person singular, RB refers to adverb, and JJ to adjective. 47 Word This Volkswagen I got my eye on is so sexy Lemma this volkswagen i get my eye on be so sexy Part-Of-Speech DT NNP PRP VBD PRP NN IN VBZ RB JJ Table 2.6: Example lemmatisation and part-of-speech tagging of an example text Example tools for part-of-speech tagging and lemmatisation are Freeling [Padr´o and Stanilovsky, 2012] and TreeTagger [Schmid, 1994]. Such tools usually make use of standardised vocabularies of tags (e.g. Santorini [1991] deﬁnes a tag-set for English and Leech and Wilson [1996] deﬁne a tag-set normally used for the Spanish language). Generally, such tools provide more features beyond lemmatisation and part-of-speech tagging. As an example, Freeling is an open-source multilingual language processing library providing a wide range of analysers for several languages, including named entity detection and classiﬁcation, dependency parsing and nominal co-reference resolution, among others. 2.6.2 Normalisation of Microposts The activity of normalising user-generated content is a crucial step before analysing social media posts, particularly on Twitter. User-generated content published in social media (specially in microblogs) is characterised by informality, brevity, frequent grammatical errors and misspellings, and by the use of abbreviations, acronyms, and emoticons. These features add additional diﬃculties in text mining processes that frequently make use of tools designed for dealing with texts, which conform to the canons of standard grammar and spelling [Hovi et al., 2013]. The micropost normalisation activity enhances the accuracy of NLP tools when applied to short fragments of texts published in social media, e.g. the syntactic normalisation of tweets improves the accuracy of existing part-of-speech 48 taggers [Codina and Atserias, 2012]. There are several techniques that can be combined for micropost normalisation, which are described next. 1. Pre-processing the micropost for detecting, removing and transforming speciﬁc social network’s metalanguage elements (e.g. hashtags, user names, URLs) into standard language constructions; e.g. Kaufmann and Jugal [2010] propose several rules for dealing with hashtags and user names. 2. Performing orthographic correction of content by relying in lexical resources like SMS lexicons for identifying abbreviations. List of correct forms are also used for performing spell correction, e.g. Gamallo et al. [2013] rely on a list of correct forms in Spanish generated by an automatic conjugator from the lemmas found in the Real Academia Espa˜ nola Dictionary (DRAE46 ). As an example result of the micropost normalisation task, the following micropost published in Twitter #worstfeeling buying a fresh laptop..then ur screen blowz out :(( may be normalised to the following text47 : worst feeling is buying a fresh laptop.. then your screen blowz out. 2.6.3 Sentiment Analysis According to Pang and Lee [2008], the analysis of emotions, opinions and appraisal regarding commercial companies, gained momentum from 2001 following slightly diﬀerent perspectives and, consequently, using terminological variations: sentiment analysis, opinion mining, brand monitoring, buzz monitoring, online anthropology, market inﬂuence analytics, conversation mining, online consumer intelligence, or user-generated content analysis are some of the terms used. These terminological divergences reﬂect diﬀerences in the connotation that each research group wants to project in their work, as well as the diﬀerent uses given in the diﬀerent epistemological communities. 46 47 http://www.rae.es/recursos/diccionarios/drae Example extracted from the paper by Kaufmann and Jugal [2010] 49 In this thesis, we have adopted a term satisfying the psychological, the linguistic and the computational projections: sentiment analysis, where sentiment is conceptualised as emotion in Clore et al. [1987] (a detectable human reaction, i.e. traceable, identiﬁable and with a particular valence). Undetermined cognitive states, with no speciﬁc sign either positive or negative, like surprise or boredom, and bodily states, such as sleepiness are excluded from the study. We also leave out the analysis of mood, because we agree with previous work by Thayer [1989] and Ekman [1994] in the sense that mood is a relatively persistent and often subtle emotional state, which is diﬀerent from emotion, as mood is less intense and variable, less likely to be related to a particular event, and thus less likely to be readily identiﬁable. Although we will mainly use the term sentiment, sometimes emotion will be employed, both terms matching the deﬁnition just stated. Pang and Lee [2008], and Liu [2010] have made a comprehensive survey describing the diﬀerent approaches followed in sentiment analysis research. They have reviewed and discussed a wide collection of related works. In general, determining which sentiment is conveyed in a text is seen as a classiﬁcation problem, which can be addressed with machine-learning techniques (supervised or unsupervised) [Mullen and Collier, 2004], rule-based systems [Chetviorkin et al., 2011; Ding and Liu, 2007], or combinations of them [Prabowo and Thelwall, 2009; Rentoumi et al., 2010]. Machine-learning classiﬁers have been fed considering diﬀerent features extracted from the text, like the simple presence of words (or n-grams in general) in the message, part-of-speech annotations or TF-IDF (Term Frequency – Inverse Document Frequency) measures. Rule-based systems have been applied both on plain texts and on part-of-speech annotated texts. Many of these systems rely on sentiment lexicons, where each lexical unit is associated to a sentiment category and, sometimes, also to a score specifying the degree of association. These lexical units can be extracted automatically (e.g. from other dictionaries) or, more uncommonly, manually. The works by Hatzivassiloglou and McKeown [1997], and Turney [2002] are examples of the ﬁrst approach. An instance of the second one is Taboada et al. [2011], whose sentiment dictionaries were created manually to produce a system for measuring the semantic orientation of texts. Some publicly available lexicons for English are SentiWordnet [Esuli and Sebastiani, 2006], the MPQA (Multi-Perspective 50 Question Answering) Subjectivity Lexicon [Wiebe et al., 2005], and the Harvard General Inquirer [Stone et al., 1966]. A multilingual perspective is being addressed by the Eurosentiment project [Buitelaar et al., 2013], whose main goal is to provide a shared language resource pool for fostering sentiment analysis. However, studies on languages diﬀerent from English are still scarce. For Spanish, we can mention Brooke et al. [2009], who adapted the lexicon-based sentiment analysis system described in Taboada et al. [2011] by automatically translating the core lexicons and adapting other resources; Sidorov et al. [2013], who presented an analysis of various parameter settings for the most popular machine-learning classiﬁers; and Vilares et al. [2013], who used the syntactic structure of the text to deal with some linguistic constructions (e.g. negation). All in all, most of the research in sentiment analysis focuses on polarity classiﬁcation. Some examples of projects that go beyond polarity can be found in Strapparava and Mihalcea [2007], which summarises the evaluation of sentiment analysis systems taking place for SemEval 2007 task on “Aﬀective Text”. The data consisted of news headlines extracted from news websites and/or newspapers, and they were annotated according to their valence (i.e. polarity) and/or six emotions (anger, disgust, fear, joy, sadness, and surprise) by diﬀerent evaluators. Three systems participated in the annotation of the six emotions: SWAT [Katz et al., 2007], UA [Kozareva et al., 2007] and UPAR7 [Chaumartin, 2007] , and only the last one followed a linguistic approach. None of them outperformed the others for all emotions. The organisers concluded that the gap between the results obtained by the systems and the upper bound represented by the annotator agreement suggested that there was room for future improvements. 2.6.4 Identiﬁcation of Wishes The ﬁrst attempt to automatically classify sentences containing wishes was performed by Goldberg et al. [2009]. The authors reported that, after a manual annotation of a corpus of wishful texts, a number of linguistic patterns related to wishes expression were identiﬁed. These patterns were used to automatically extract the sentences that contained wishes. The precision results stated by Goldberg et al. [2009] was 80%, but combining these linguistic patterns with the most 51 frequent words and for user-generated texts related to the area of politics. When applying the same method to product reviews, precision falls to 56%. More recent works in this area are those carried out by Wu and He [2011] and Ramanand et al. [2010]. In these studies the authors investigate methods to automatically identify diﬀerent types of wishes (speciﬁcally the wish to suggest and the wish to purchase) and ﬁnd linguistic patterns to extract them. Ramanand et al. [2010] also used linguistic patterns to discover two speciﬁc types of wishes, as mentioned before: sentences that make suggestions about existing products, and sentences that indicate purchasing interest. Note that Ramanand et al. [2010] wish types are similar to the evaluation and purchase stages of the Consumer Decision Journey we address in this paper. Ramanand et al. [2010] reported precision and recall are 62% and 48.5% respectively for suggestions and 86.7% and 57.8% for purchase. 2.6.5 Detection of Place of Residence The identiﬁcation of the geographical origin of social media users has been tackled in the past by several research works. Mislove et al. [2011] estimate geographical location for Twitter users by exploiting the self-reported location ﬁeld in the user proﬁle. Content-analysis approaches are appropriate when the user location is not self-declared in the user proﬁle. Cheng et al. [2010] propose to obtain user location based on content analysis. The authors use a generative probabilistic model that relates terms with geographic focuses on a map, placing 51% of Twitter users within 100 miles of their actual location. Backstrom et al. [2008] described also a probabilistic model. Chang et al. [2012] follow a similar approach, consisting in estimating the city distribution on the use of each word. In addition, Rao et al. [2010] describe a method for obtaining user regional origin from content analysis, testing diﬀerent models based on Support Vector Machines (SVM) [Cortes and Vapnik, 1995], achieving a 71% of accuracy when applying a model of socio-linguistic features. 52 2.6.6 Detection of Gender With respect to gender identiﬁcation, Mislove et al. [2011] use the user name for identifying his/her gender, achieving a coverage (i.e. proportion of users classiﬁed) of 64.2%. Burger et al. [2011] propose to use more metadata and content features for training an automatic classiﬁer. Using only the full name of the users, an accuracy of 0.89 is reached. An accuracy of 0.92 is achieved by using the descriptions of the users, their screen names and the text of the tweets published by them. Rao et al. [2010] authored another relevant related work regarding gender identiﬁcation. In this case the proposed method, based on SVM, tries to distinguish the author gender exclusively from the content and style of their writing. This solution needs an annotated seed corpus with authors classiﬁed as male or female, to create the model used by the SVM classiﬁer. In this case the accuracy of the best model is 0.72, lower than considering the full name of the author. 2.6.7 Conclusions Lemmatisation and part-of-speech tagging tools oﬀer text processing and language annotation facilities to NLP application developers, lowering the cost of building those applications. Social media user-generated content has particular characteristics (informality, brevity, frequent grammar errors and misspellings, abusive use of abbreviations, acronyms and emoticons, etc.). Text mining is based on the use of tools that cannot handle this broad range of variations in a language. Therefore the task of linguistic normalisation is a necessary step before performing NLP activities like part-of-speech tagging. Open Research Problem 7. Regarding sentiment analysis, while polarity detection has been addressed for many languages, including English and Spanish, and there are techniques for detecting emotions beyond polarity classiﬁcation for English, there are not existing approaches for identifying emotions for the Spanish language. The work we present in this thesis oﬀers a more in-depth analysis of user- 53 generated content than sentiment analysis. In our work, we identify critical information about consumer behaviour: we provide information about how customers are distributed along the four stages of the Consumer Decision Journey and about the nature of their comments in terms of categories of the Marketing Mix. The automatic identiﬁcation of wishful sentences is the area where we have found more similarities with our work, both in terms of objectives and used technologies. To the best of our knowledge, there is no previous work that addresses these tasks. Nevertheless, the identiﬁcation of wishful sentences oﬀers some similarities that allow for a basic comparison. Author and content metadata is not enough for capturing socio-demographic attributes like gender and place of residence. As an example, not all the social media channels qualify their users neither with gender nor with geographical location. Some channels, such as Twitter, allow their authors to specify their geographical location via a free text ﬁeld. However, this text ﬁeld is often left empty, or ﬁlled with ambiguous information (e.g. Paris - France vs. Paris - Texas), or with other data that is useless for obtaining real geographical information (e.g. “Neverland”). Open Research Problem 8. The existing techniques for identifying the place of residence of social media users do not combine diﬀerent metadata that may improve their accuracy. Among the metadata that can be used for this purpose are the descriptions included in users’ proﬁles, the friendship networks, and the locations found in the content shared and produced by them. Open Research Problem 9. The existing techniques for identifying the gender of social media users achieve good results of coverage and accuracy by using features extracted from metadata about users, as well as from the content published by them in the form of character n-grams. However, none of them take the advantage of the linguistic information that can be extracted from the content, such as gender concord (a.k.a. agreement). This may improve the proportion of users with a gender identiﬁed when it is not possible to recognise it from user’s proﬁle metadata. 54 2.7 Open Research Problems We have identiﬁed the following open research problems in the State of the Art that are addressed in this thesis. 1. There is a lack of data models for modelling the information that can be extracted from social media for the marketing domain. 2. There is a lack of a characterisation of social media according to linguistic features of the textual contents published on them. 3. The technique for uniquely identifying users in the Web based on the ﬁngerprint of their navigation devices fails when such ﬁngerprint evolves over time. 4. There is a lack of techniques for classifying consumer opinions according to multiple socio-demographic and psychographic attributes commonly used in the ﬁeld of marketing for consumer segmentation. 5. There are not techniques for the classiﬁcation of electronic word-of-mouth according to the Consumer Decision Journey framework. 6. There are not techniques for identifying Marketing Mix attributes in consumer opinions. 7. There are not techniques for detecting emotions in Spanish that go beyond polarity detection. 8. The existing techniques for identifying the place of residence of social media users do not take advantage of combining useful metadata that may improve their accuracy. 9. The existing techniques for identifying the gender of social media users do not take advantage of the linguistic information that can be extracted from the content, such as gender concord. 55 56 Chapter 3 APPROACH In this chapter we describe the objectives pursued by this thesis together with its main contributions. We also present the hypothesis along with the restrictions and assumptions upon which our research relies. 3.1 Objectives The goal of this thesis is to provide techniques for extracting consumer segmentations from the content generated by consumers in social media, their proﬁle metadata, and their activities when navigating social media websites. According to the overall objective and to the open research problems identiﬁed in the State of the Art (see Chapter 2), we have deﬁned the speciﬁc objectives of this thesis, which are described next. O1. To provide a normalised schema for structuring the information published in social media that can be used for marketing purposes. As depicted by the Open Research Problem 1, there are not data models for representing information captured from social media that integrate marketing-speciﬁc classiﬁcations and KPIs obtained from the analysis of the content generated by consumers and their social network proﬁles, as well as from the activity produced by them in social media. 57 The data model described in this thesis will allow integrating, using a single format, data from social media as well as the data inferred by applying the analysis techniques presented in this thesis. In addition, the model will unify the semantics of the information extracted from heterogeneous sites, by linking social media instances (e.g. posts, users, topics) regardless their speciﬁc publication channels. O2. To characterise the diﬀerent social media types from the point of view of the morphosyntactic characteristics of their textual contents. As shown by the Open Research Problem 2, there is not a characterisation of the diﬀerent kinds of social media with respect to the linguistic characteristics of the content published on these media. O3. To provide a ﬁngerprint-based technique for identifying the activity of consumers in diﬀerent websites that is able to detect changes in the device ﬁngerprint. As shown by the Open Research Problem 3, the existing techniques for counting unique visitors are losing eﬀectiveness, because of privacy restrictions and of new devices for navigating the Web. The ﬁngerprinting technique deals with such restrictions and devices but is quite sensible to changes in the attributes of the web browser, which leads to counting unique visitors imprecisely. O4. To provide a collection of automatic techniques for extracting consumer segmentations according to their demographic and psychographic traits, from the analysis of content generated by them in social media. As reﬂected by the Open Research Problem 4, there are no techniques for obtaining many of the demographic and psychographic attributes used in marketing from which to obtain KPIs beyond the polarity of opinion and the volume of publications. In this work we propose to automate the identiﬁcation of a collection socio-demographic and psychographic attributes from the content generated by consumers, by providing a set of individual 58 techniques for capturing each of these attributes. We aim for an analytic technology that is able to perform a ﬁne-grained analysis and that provides information about the consumer behaviour. The automation of the activities oriented to capture these attributes from social media is unavoidable in order to drastically reduce analysis time and the eﬀorts required to process the available large amount of data. Speciﬁcally, this objective is limited to the following sub-objectives: O4.1. To provide techniques for classiﬁcation of consumer opinions produced in social media according to the Consumer Decision Journey framework. As shown by the Open Research Problem 5, there are not techniques that address the classiﬁcation of consumer opinions according to the Consumer Decision Journey framework. Our objective in this work is to build a classiﬁer for English and Spanish to assign e-WOM (electronic word-of-mouth) short texts to one single phase of the so-called Consumer Decision Journey (see Section 2.5.1). Such a textual classiﬁcation on diﬀerent stages of the purchase process places customers in the exact moment of their purchase journey. O4.2. To provide techniques for classiﬁcation of consumer opinions produced in social media according to the Marketing Mix framework. As shown by the Open Research Problem 6, there are not techniques that address the classiﬁcation of consumer opinions according to the Marketing Mix framework. Our objective in this work is to build a classiﬁer for English and Spanish to assign comments published by consumers about brands to Marketing Mix elements (see Section 2.5.2) expressed in a text. The classiﬁcation of texts extracted from diﬀerent social media channels in terms of them belonging to one or more Marketing Mix elements gives us information about what marketingrelated issues are the customers talking about. O4.3. To provide a technique for identifying emotions expressed by consumers in social media for Spanish. 59 As shown by the Open Research Problem 7, there are not techniques that address the identiﬁcation of emotions for the Spanish language that go beyond polarity detection (i.e. automatically discovering pleasure or displeasure in texts). Speciﬁcally, this thesis addresses the identiﬁcation of emotions according to the eight categories shown in Table 2.4 (satisfaction, dissatisfaction, trust, fear, happiness, sadness, love, and hate), overcoming the limitations of current sentiment analysis approaches, which analyse only the polarity of the sentiments expressed in user messages written in Spanish. Classiﬁcation of usergenerated content according to the emotions expressed in them might be useful not only for several Business Intelligence ﬁelds such as marketing, sales, or customer service but also for public opinion analysis where research on people’s behaviour is crucial. O4.4. To provide a technique for recognising the place of residence of social media users that improves the accuracy of existing techniques. As shown by the Open Research Problem 8, diﬀerent approaches and kinds of metadata can be used for improving the accuracy of existing techniques. Our objective is to deﬁne and validate a technique that exploits user proﬁles descriptions, friendship networks, and geographical entity recognition within contents for detecting the place of residence of social media users. O4.5. To provide a technique for recognising the gender of social media users that improves the coverage of the techniques based in proﬁle metadata by exploiting the linguistic information that can be extracted from the content written in Spanish. As shown by the Open Research Problem 9, the existing techniques for gender identiﬁcation do not take into account the linguistic information that can be extracted from content analysis for improving their coverage. 60 3.2 Contributions to the State of the Art This thesis contributes to the State of the Art with a data model and a set of techniques that address the objectives described in the previous section. The contributions of this thesis are explained next. C1. A normalised schema for representing the information extracted from heterogeneous social media about brands, consumers and opinions of consumers about brands, useful for the marketing domain. This schema includes concepts and attributes for modelling the content and metadata deﬁned explicitly in social media. In addition to these explicitlydeﬁned data, the schema provides concepts and attributes for representing the data enrichments inferred when applying the user identiﬁcation technique (C2) and consumer segmentation techniques (C3). The schema has been designed as a semantic data model deﬁned by an ontology network reusing ontologies widely used in the Semantic Web and Linked Data ﬁelds. C2. A descriptive characterisation of social media types from the point of view of the morphosyntactic characteristics of the content published on them. We have processed and characterised corpora of user-generated content extracted from diﬀerent social media sources. Speciﬁcally, we have studied diﬀerences of the language used in distinct types of social media content by analysing the distribution of part-of-speech categories in such sources. C3. A technique for the identiﬁcation of unique users from the ﬁngerprint of the devices they use when interacting with social media, which is tolerant to changes in such ﬁngerprint. This thesis will contribute to the State of the Art with an algorithm, based on the ﬁngerprinting technique deﬁned by Eckersley [2010], which allows identifying unique visitors accurately, regardless of changes in browser attributes. For doing so, our algorithm is able to detect the evolution of 61 ﬁngerprint and therefore, to eﬀectively group distinct ﬁngerprints that correspond to the same user. C4. A collection of techniques for extracting socio-demographic and psychographic proﬁles from social media users applied to the marketing domain. The socio-demographic variables considered include gender and place of residence, while the psychographic information includes purchase intention, Marketing Mix elements, and emotional perceptions about brands. Speciﬁcally, this thesis provides the following contributions to the State of the Art. C4.1. A technique for classifying consumer opinions produced in social media according to the Consumer Decision Journey stages for texts written in English and Spanish. We have developed a classiﬁer based on the identiﬁcation of linguistic patterns in short texts. These linguistic patterns were then used as a part of a set of rules to classify each particular text into one of the Customer Decision Journey stages. C4.2. A technique for classifying consumer opinions produced in social media according to the Marketing Mix framework for texts written in English and Spanish. We have developed a classiﬁer based on machine-learning techniques, speciﬁcally on Decision Tree (DT) learning algorithms. C4.3. A technique for analysing consumer opinions written in Spanish according to the emotions expressed in such opinions that goes beyond polarity identiﬁcation by identifying the following sentiment categories: satisfaction, dissatisfaction, trust, fear, happiness, sadness, love, and hate. We have developed a technique for classifying the texts of a corpus of consumer opinions about brands according to the sentiment they express. Unlike many existing solutions that focus on polarity classiﬁcation, which deal with English texts and extract documents from 62 speciﬁc channels and a few domains, in the work presented in this thesis we are interested in an eight-sentiment classiﬁcation of Spanish texts that consist of documents with diﬀerent sizes and characteristics from diverse social media and product domains. C4.4. A technique for identifying the place of residence of social media users that improves the accuracy of existing techniques. The technique proposed exploits the metadata declared by social media users in their social network proﬁles, the locations included in the contents published and shared by them, and their friendship networks. C4.5. A technique for identifying the gender of social media users that exploits the gender concord existing for the Spanish language. The technique proposed exploits the metadata declared by social media users in their social network proﬁles and takes the advantage of the linguistic concord existing in certain languages like Spanish for determining the gender of the users mentioned in the content produced by other users. Figure 3.1 depicts the contributions to the State of the Art of this thesis. The contributions of this thesis can be grouped into three tiers. • The Earned Media Knowledge Base provides the data warehouse for storing marketing-oriented structured information extracted from social media or inferred from it. The contribution C1 provides the ontology network that models such data warehouse. • The Inference Layer provides the engine that can reason about the facts extracted from social media producing new inferences. The contribution C3 provides a technique for identifying users uniquely from their web activity, while the contribution C4 provides a collection of techniques for segmenting consumers from the information shared and published by them in social media. • The Social Media Characterisation tier provides observations on social media content attributes that may be considered for producing the algorithms 63 Social Media Characterisation C2. Morphosyntactic characterisation of social media contents Inference Layer C3. Technique for unique user identification based on evolving fingerprint detection C4. Techniques for segmentation of consumers from social media content C4.1. Technique for detecting Consumer Decision Journey stages C4.2. Technique for detecting Marketing Mix attributes C4.3. Technique for detecting emotions C4.4. Technique for detecting the place of residence of social media users C4.5. Technique for detecting the gender of social media users Earned Media Knowledge Base C1. Social media data model for consumer analytics Figure 3.1: Contributions to the State of the Art of the Inference Layer. The contribution C2 provides a characterisation form the point of view of the morphosyntactic attributes of the content published in social media. 3.3 Assumptions The models and techniques proposed in this thesis rely on the following assumptions. Assumption 1. It is possible to structure the content published on social media (and the associated metadata) according to a single normalised data schema. 64 Assumption 2. The information structured according to the data model proposed, including data explicitly deﬁned in social media and data enrichments obtained by our analysis techniques can be used for higher-level Business Intelligence processes, like the ones presented in Section 2.4. Assumption 3. Consumers’ demographic and psychographic proﬁles (feelings, interests, etc.) can be obtained from social media, even if those proﬁles are not declared explicitly by the user, by analysing the content published and shared by such consumers, as well as other metadata, such as proﬁle information and friendship networks. 3.4 Hypotheses The overall research hypothesis of this work is that is possible to extract information useful for marketing activities from the content and activity generated by consumers in social media, despite the heterogeneity of textual contents and metadata, and disparate access devices. The speciﬁc hypothesis are described next. Hypothesis 1. The contents published in social media statistically present diﬀerent morphosyntactic features depending on the speciﬁc kind of media where they have been published. Hypothesis 2. The online activity generated by consumers in social media can be grouped and identiﬁed eﬀectively through the digital ﬁngerprint of their devices by using the technique described in this thesis, even when such ﬁngerprint varies over time. The technique must outperform the existing approach authored by Eckersley [2010], whose accuracy, false positive rates, and coverage (i.e. percentage of browsers classiﬁed) are 0.991, 0.0086 and 65% respectively. Hypothesis 3. Consumers utilise diﬀerent expressions along the four stages of the Consumer Decision Journey. Therefore, if we are able to identify the particular linguistic expressions used in each of the stages of the purchase process, we will be able to classify texts along the diﬀerent phases and, consequently, we 65 will be able to approximate distributions of consumers in diﬀerent moments of the Consumer Decision Journey process. Although there are not existing techniques for identifying Consumer Decision Journey Stages from user-generated content, the results provided by this this thesis must be in line with existing approaches for the identiﬁcation of wishes with precisions that vary from 56% to 86.7%, depending on the wish type. Hypothesis 4. The vocabulary used by consumers when publishing comments about brands in social media can be used to identify the Marketing Mix attributes they are referring to. Therefore, if we are able to identify the particular lexical elements that refer to such attributes, we will be able to classify text according to the Marketing Mix framework and, consequently, we will be able to approximate distributions of consumers that refer to the distinct Marketing Mix elements. Hypothesis 5. Consumers utilise diﬀerent expressions to express their sentiment about brands beyond their pleasure and displeasure about brand products — speciﬁcally for expressing the satisfaction, dissatisfaction, trust, fear, love, hate, happiness, and sadness sentiments. Thus, if we are able to identify the particular linguistic expressions used for each of these sentiments, we will be able to classify texts along the diﬀerent emotions and, consequently, we will be able to approximate distributions of consumers according to ﬁne-grained sentiments about brands. Hypothesis 6. The homophily existing between the users of a social network [McPherson et al., 2001] can be used for improving the accuracy of existing techniques for identifying their place of residence (from 51% to 71%). Speciﬁcally the friendship network of a given user can be used for estimating her/his place of residence, as the major part of her/his friends may share her/his location. Hypothesis 7. The linguistic concord existing in the posts written in Spanish that explicitly mention social media users can be exploited for enhancing the coverage of the gender identiﬁcation techniques that make use of the name declared by users in their proﬁles. 66 3.5 Restrictions Restriction 1. The technique for identifying unique users from their online activity is restricted to the identiﬁcation of the unique devices that they use for browsing the Web. The consolidation of multiple devices in a unique user identity (e.g. relating her smartphone and tablet ﬁngerprints) is out of the scope of the technique proposed. Cross-device and cross-site identiﬁcation can be performed by combining logged sessions with ﬁngerprints records or third party cookies and do not suppose a research problem. Restriction 2. The techniques for the analysis of user-generated content presented in this thesis are restricted to textual content. Therefore the analysis of audio-visual content is out of the scope of this thesis. Restriction 3. This thesis provides techniques for inferring psychographic characteristics of consumers related with their position in the Consumer Decision Journey and the Marketing Mix attributes they consider when talking about products and brands. The mining of other psychographic characteristics, such as hobbies or interests, is out of the scope of this thesis. Restriction 4. This thesis provides techniques for inferring socio-demographic characteristics of consumers related with their gender and place of residence. The mining of other socio-demographic characteristic used in the marketing domain, such as age or purchasing power, is out of the scope of this thesis. Restriction 5. The technique for detecting Consumer Decision Journey stages in user-generated content is limited to the English and Spanish languages. Other languages are out of the scope of this thesis. Restriction 6. The technique for detecting Marketing Mix elements in usergenerated-content is limited to the English and Spanish languages. Other languages are out of the scope of this thesis. Restriction 7. The technique for detecting emotions in user-generated content is limited to the Spanish language. Other languages are out of the scope of this thesis. 67 Restriction 8. The text-mining techniques provided by this thesis have been evaluated with corpora extracted from social media consisting in posts mentioning brands of the following commercial sectors: automotive, banking, beverages, sports, telecommunications, food, retail, and utilities. The accuracy of the techniques may vary signiﬁcantly when applied to posts mentioning brands belonging to other sectors. Restriction 9. The deployment of the techniques proposed by this thesis in an industrial environment, as well as the validation of their scalability is out of the scope of this thesis. Nevertheless, we have performed some preliminary tests regarding scalability whose results are shown in Section 9.4.7. Restriction 10. We have chosen Freeling for executing the lemmatisation, partof-speech tagging and dependency parsing tasks of contribution C4, because it is customisable, extensible and robust, and oﬀers a high reliability for Spanish. The evaluation results could vary slightly if another computational linguistic software was used. Restriction 11. As an exception, for contribution C2 we have used TreeTagger for Spanish due to project technology requirements at the moment in which the study was performed. Therefore, the part-of-speech distributions provided may also may vary with the use of a diﬀerent part-of-speech tagging. Finally, to conclude this chapter, Figure 3.2 show the relationships among the objectives, contributions, assumptions, hypotheses and restrictions of this thesis. 68 Objectives achieve Contributions apply to Hypotheses, Assumptions and Restrictions O1 C1 A1 A2 O2 C2 H1 R11 O3 C3 H2 R1 O4 C4 A3 R2 R9 O4.1 C4.1 H3 R5 O4.2 C4.2 H4 R6 O4.3 C4.3 H5 R7 O4.4 C4.4 H6 O4.5 C4.5 H7 R3 R4 R8 R10 Figure 3.2: Relationships between the objectives, contributions, assumptions, hypothesis and restrictions 69 70 Chapter 4 RESEARCH METHODOLOGY This chapter describes the research methodology followed for obtaining the contributions of this work. Before describing the methodology, Section 4.1 provides deﬁnitions for the terms methodology, method, techniques, process, activity and task, which appear frequently in this thesis. After providing these deﬁnitions, Section 4.2 describes the research methodology, and Section 4.3 details the methods followed for obtaining the ontology and techniques provided by this thesis. 4.1 Terminology Throughout literature, the terms methodology, method, technique, process, activity, etc. are used indistinctively. Therefore, for the shake of clarity, in this thesis we have adopted several IEEE48 deﬁnitions, which are described in detail in diﬀerent sources [IEEE, 1990, 1995a,b, 1997; Sommerville, 2007] and shown in Figure 4.1. Deﬁnition 5. A methodology is a comprehensive, integrated series of techniques or methods that create a general system theory of how a class of thoughtintensive work ought to be performed [IEEE, 1995a]. Deﬁnition 6. Methods are parts of methodologies. A method is a set of “orderly processes or procedures used in the engineering of a product or in performing a service” [Sommerville, 2007]. Methods are composed of processes. 48 http://www.ieee.org 71 Methodology composed of Method composed of Technique composed of Process composed of Activity composed of Task specify Figure 4.1: Relations between methodology, methods, techniques, processes, activities and tasks (adapted from G´omez-P´erez et al. [2004]) Deﬁnition 7. Techniques are parts of methodologies. Techniques are “the application of accumulated technical or management skills and methods in the creation of a product or in performing a service” [IEEE, 1990]. Techniques detail methods and their components (processes, activities and tasks). Deﬁnition 8. A process is a set of activities whose goal is the development or the evolution of software [Sommerville, 2007]. Deﬁnition 9. An activity is a deﬁned body of work to be performed, including its required input and output information [IEEE, 1997]. Activities can be divided into zero or more tasks. Deﬁnition 10. A task is the smallest unit of work subject to management accountability. A task is a well-deﬁned work assignment for one or more project members. Related tasks are usually grouped to form activities [IEEE, 1995b]. 72 4.2 Research Methodology This research was motivated from the need that the marketing ﬁeld has for measuring and understanding the eﬀects of earned media during advertising campaigns. Therefore we initially deﬁned a broad research problem: to develop techniques for acquiring marketing-oriented knowledge from the unstructured content published in social media. Thus to reﬁne this research problem and deﬁne the objectives and hypotheses of the thesis we followed a iterative methodology consisting of two stages (see Figure 4.2). In the ﬁrst stage we used an exploratory approach [Kothari, 2004]. The objective of exploratory research is to deﬁne the research problem and the hypotheses to be tested. Accordingly, in the ﬁrst state we reviewed the State of the Art on approaches for knowledge acquisition from user-generated content and user activity, as well as the marketing background of our thesis. This review of the State of the Art, which was presented in Chapter 2, helped us to specify in more detailed terms the deﬁnition of the research problem and the hypothesis of our work. Therefore, we deﬁned our research problem more precisely in terms of providing techniques for extracting consumer segmentations from the content generated by consumers in social media, their proﬁle metadata, and their activities when navigating social media websites. The objectives, as well as the hypotheses in which Explorative Research Experimental Research Review of the State of the Art Design Experiments & Evaluate Define Problem, Hypotheses, and Objectives Propose Solution Figure 4.2: Iterative research methodology using exploratory and experimental approaches 73 we rely to propose a solution for this problem were presented in Chapter 3. Once we had deﬁned the research problem we proceeded to the second state where we followed an experimental approach [Dodig-Crnkovic, 2002; Kothari, 2004]. Our objective in the experimental research was to propose a solution based on the hypotheses to fulﬁl the research objectives and design experiments to validate the hypotheses. In this stage we investigated existing techniques in other research ﬁelds such as Natural Language Processing and Information Retrieval which might help to reach the objectives. Then we adapted these techniques to the requirements deﬁned by the particularities of our research. After this, we designed the experiments to validate the proposed solutions, using well-known evaluation metrics. Next, we carried out an abstraction exercise over the procedure that we had followed when developing the techniques, and designing and executing the experiments. The objective was to elicit commonalities in the form of data models, activities, and tasks. Thus, with these components we produced the contributions of this thesis. We performed ﬁve interactions, one per technique provided (contributions C3, C4.1, C4.2, C4.3, C4.4, and C4.5). The ontology (contribution C1) was continuously reﬁned during the execution of each interaction. The morphosyntactic characterisation of social media contents (contribution C2) was produced at a preliminary stage of the ﬁrst interaction. 4.3 Method Followed for Obtaining the Artefacts Provided by this Thesis Extracting knowledge from social media information requires: (i) building a data warehouse from which obtaining insights by querying it and, (ii) applying diﬀerent analysis techniques for obtaining knowledge from the data warehouse, such as graph and time series analyses. The method that we have followed for obtaining the artefacts provided by this thesis is inspired in an existing framework deﬁned by Hu and Cercone [2004] for Web mining and Business Intelligence reporting. This framework follows the data warehousing approach proposed by Kimball et al. [1998]; Kimball and Ross [2002] and provides guidelines for performing research 74 Data Capture Data Webhouse Construction Mining, OLAP (clickstream, sale, customer, product, etc.) (clickstream, sale, customer, product, etc.) (rules, prediction models, cubes, reports, etc.) Pattern Evaluations & Deployment Figure 4.3: Web mining framework (adapted from Hu and Cercone [2004]) on data extracted from the Web, including guidelines for the data warehouse construction, among other activities. Figure 4.3 illustrates the data ﬂow proposed by the framework, which involves the following phases: 1. Data Capture. This phase consists in capturing and cleansing data coming from heterogeneous web data sources. 2. Data Webhouse Construction. This phase consists in creating a database for storing the data gathered in the previous activity. To do this, the database requirements are analysed, the database schema is deﬁned, and the data captured are transformed according to this schema. 3. Mining, OLAP. This phase consists in the execution of data mining tasks in order to derive useful knowledge from the data stored in the database created in the previous activity. 4. Pattern Evaluations and Deployment. This phase consists in the evaluation of the models obtained in the previous activity, as well as on the deployment of the learning validated. We follow two methods for dealing with the data mining phases deﬁned by Hu and Cercone [2004]: • For addressing the Data Webhouse Construction phase we follow the methodology proposed by Su´arez-Figueroa et al. [2012] for constructing ontology networks. Section 4.3.1 describes the method followed for constructing the social media data model that will be described in Chapter 5. 75 • For addressing the other phases (Data Capture, Mining and Evaluation and Deployment) we follow the CRISP-DM reference process model [Shearer, 2000], which is a framework that describes a set of generic activities and tasks that any data mining process may implement. Section 4.3.2 describes the method followed by the data-mining techniques proposed by this thesis hat will be described in chapters 7 and 8. 4.3.1 Method Followed for Ontology Engineering We have followed the NeOn methodology for building ontology networks [Su´arezFigueroa et al., 2012] for engineering the social media data model provided by this thesis. Such methodology: (i) proposes the processes and activities required involved in the construction of ontology networks, (ii) deﬁnes two ontology development life cycle models, (iii) identiﬁes and describes a set of scenarios for building ontology networks, and (iv) provides a set of methodological guidelines for performing some of the processes and activities proposed. Speciﬁcally, we have implemented the Reusing Ontological Resources scenario, as we have reused existing ontologies in the constriction of our data model. The sequence of activities in this scenario is the following: 1. Ontology Search. This activity consists in ﬁnding candidate ontologies or ontology modules to be reused. We have searched for the candidate ontological resources that satisfy the requirements using search services for the Web. 2. Ontology Assessment. This activity consists in checking an ontology against the user’s requirements, such as usability, usefulness, abstraction, quality. After executing this activity we obtained a list of candidate ontologies for being reused, which has been described in Section 2.1. 3. Ontology Comparison. This activity consists in ﬁnding diﬀerences between two or more ontologies or between two or more ontology modules. 4. Ontology Selection. This activity consists in choosing the most suitable ontologies or ontology modules among those available in an ontology repository or library, for a concrete domain of interest and associated tasks. The 76 result of this activity has been a selection of ontologies for being reused, which are listed in Table 5.1 of Chapter 5. 5. Ontology Integration. This activity consists in integrating one ontology into another ontology. The ontologies selected have been imported into the ontology network depicted in Figure 5.1 of Chapter 5. Apart from the activities deﬁned by this scenario, we have implemented the following activities (deﬁnitions literally taken from Su´arez-Figueroa et al. [2012]): Ontology Annotation. It refers to the activity of enriching the ontology with additional information, e.g. metadata or comments. We have commented each new ontology element. Ontology Conceptualisation. It refers to the activity of organising and structuring the information (data, knowledge, etc.), obtained during the acquisition process, into meaningful models at the knowledge level and according to the ontology requirements speciﬁcation document. This activity is independent of the way in which the ontology implementation will be carried out. Previously to the Ontology Reuse Process we identiﬁed the concepts, attributes and relations that the ontology network must cover. Ontology Documentation. It refers to the collection of documents and explanatory comments generated during the entire ontology building process. This thesis includes the documentation of the developed ontology network. Ontology Elicitation. It is a knowledge acquisition activity in which conceptual structures (i.e. T-Box) and their instances (i.e. A-Box) are acquired from domain experts. In our case, we obtained conceptual structures and types from the marketing frameworks described in the State of the Art (see sections 2.5.1, 2.5.2 and 2.5.3). Ontology Enrichment. It refers to the activity of extending an ontology with new conceptual structures (e.g. concepts, roles and axioms). After performing the Ontology Integration activity, there were missing ontology elements for modelling some concepts, attributes and properties identiﬁed during 77 the conceptualisation phase. Therefore we enriched the ontology network with our own ontology elements, which have been grouped under a speciﬁc namespace. Ontology Environment Study. It refers to the activity of analysing the environment in which the ontology is going to be developed. Such environment has been described in Section 2.5. Ontology Implementation. It refers to the activity of generating computable models according to the syntax of a formal representation language (e.g. RDFS49 and OWL50 ). Our ontology has been implemented using OWL. Ontology Modularisation. It refers to the activity of identifying one or more modules in an ontology with the purpose of supporting reuse or maintenance. We have structured our ontology into seven modules that are described in Chapter 5. Ontology Summarisation. It refers to the activity of providing an abstract or summary of the ontology content. We have summarised the ontology network using a UML [OMG, 2011] representation, which has been included in Chapter 5. Regarding the ontology development life-cycle, we have selected an iterativeincremental ontology network life cycle model, as requirements were changing during the ontology development. 49 50 http://www.w3.org/TR/rdf-schema http://www.w3.org/TR/owl2-primer 78 4.3.2 Method Followed for the Data Mining Techniques This research is framed within the reference model CRISP-DM (Cross Industry Standard Process for Data Mining), applied to the extraction of information from social media. Therefore, we have instantiated the activities and tasks within this process for performing our research. Figure 4.4 shows the activities involved in the CRISP-DM process. Next, each of the activities are described, as well as the tasks that have been instantiated by the contributions of this thesis. Business Understanding Data Understanding Data Preparation Deployment Data Modeling Evaluation Figure 4.4: The CRISP-DM reference model (adapted from Shearer [2000]) 79 4.3.2.1 Business Understanding This initial activity focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem deﬁnition and a preliminary plan designed to achieve the objectives. The result of this activity has been included in Chapter 2 where the State of the Art, and speciﬁcally the marketing frameworks have been described, as well as in Chapter 3, where the objectives, contributions, assumptions, hypotheses and restrictions of this research have been detailed. 4.3.2.2 Data Understanding This activity starts with initial data collection and proceeds with tasks that enable data analysts to become familiar with the data, identify data quality problems, discover ﬁrst insights into the data, and/or detect interesting subsets to form hypotheses regarding hidden information. The tasks involved in this activity are the following: Collect Initial Data. The goal of this task is to acquire the data used for learning purposes. Describe Data. This task consists in examining the “gross” or “surface” properties of the acquired data, describing the format of the data and the quantity of the data, as any other relevant features that have been discovered. Explore Data. This task addresses data mining questions using querying, visualisation, and reporting techniques, obtaining distributions of key attributes, relations between pairs of attributes and other simple statistical analyses. Verify Data Quality. This task examines the quality of the data, addressing questions such as data completeness. The data mining techniques proposed by this thesis, which are described in chapters 7 and 8, implement this activity. 80 4.3.2.3 Data Preparation This activity covers all tasks needed to construct the ﬁnal dataset —data that will be fed into the modelling tools—, from the initial raw data. The tasks involved in this activity are the following: Select Data. The goal of this task is to decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types. Clean Data. The goal of this task is to raise the data quality to the level required by the selected analysis techniques. This task may involve the selection of clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling. Construct Data. This task performs data preparation operations such as the production of transformed values for existing attributes. This activity is also implemented by the data mining techniques proposed by this thesis, which are described in chapters 7 and 8. 4.3.2.4 Modelling This activity applies one or more techniques for obtaining a ﬁnal model. When the performance of the model obtained depends on parameters, such parameters are calibrated to optimal values. The tasks involved in this activity are the following: Select Modelling Technique. The goal of this task is to select the actual modelling technique to be used (e.g. decision-tree building, rule-set engineering). Generate Test Design. The goal of this task is to generate a procedure to test the model for quality and validity. This involves choosing evaluation metrics like precision or recall, and separating the dataset into training and test sets. Build Model. The goal of this task is to create the model. This typically involves running a modelling tool on the prepared dataset and performing 81 human supervision on the model, depending on the modelling technique chosen. This activity is also implemented by the data mining techniques proposed by this thesis, which are described in chapters 7 and 8. 4.3.2.5 Evaluation This activity consists in evaluating the model obtained in order to assess that it has a high quality from a data analysis perspective as well as to be certain the model properly achieves the business objectives. This activity is also implemented by the data mining techniques proposed by this thesis, which are described in chapters 7 and 8. 4.3.2.6 Deployment This activity integrates the model obtained into the application that will make use of it. As stated by Restriction 9, the deployment of the data mining techniques proposed by this thesis is part of the future work. However, we have performed some preliminary tests regarding scalability whose results are shown in Section 9.4.7. 82 Chapter 5 SOCIAL MEDIA ONTOLOGY FOR CONSUMER ANALYTICS This chapter describes the ontology for representing the information extracted from social media as well as the knowledge about consumers that can be inferred from such information by applying the analysis techniques presented in this thesis, which are described in the following chapters. The social media ontology has been deﬁned as an ontology network, called Social Graph Ontology (SGO)51 . Such ontology reuses existing semantic vocabularies, which have been already described in Section 2.1. The reused vocabularies are enumerated in Table 5.1. Figure 5.1 shows the import relations between the Social Graph Ontology and the rest of vocabularies (non-dashed lines). In addition, the dashed lines represent the existing import relations between the vocabularies reused. The colours associated to each vocabulary are used to denote the namespaces to which the classes and properties of the ontology network belong. 51 The Social Graph Ontology OWL implementation has not been made public due to the exploitation rights deﬁned by the Social TV Project (TSI-100600-2013-53) 83 Vocabulary SIOC FOAF schema.org Dublin Core SKOS ISOcat Marl Onyx WGS84 Time Zone Ontology Named Graphs Preﬁx sioc foaf schema dcterms skos isocat marl onyx geo tzont rdfg Namespace http://rdfs.org/sioc/ns# http://xmlns.com/foaf/0.1/ http://schema.org/ http://purl.org/dc/terms/ http://www.w3.org/2004/02/skos/core# http://www.isocat.org/ns/dcr.rdf# http://purl.org/marl/ns# http://www.gsi.dit.upm.es/onlogies/onyx/ns# http://www.w3.org/2003/01/geo/wsq84_pos# http://www.w3.org/2006/timezone# http://www.w3.org/2004/03/trix/rdfg-1/ Table 5.1: Vocabularies selected for deﬁning the Social Graph Ontology onyx marl tzont isocat sgo schema geo skos rdfg sioc foaf dcterms Figure 5.1: Ontology network 84 5.1 Ontology Modules The ontology is divided into seven ontology modules that are shown in Figure 5.2. The arrows represent usages of ontology elements contained in the modules pointed by such arrows. The modules of the Social Graph Ontology are the following: 1. The Core Ontology Module deﬁnes the main components of the ontology (see Section 5.3). 2. The Publication Channels Module deﬁnes the ontology elements in charge of representing information related to the content publication media (see Section 5.4). 3. The Contents Module deﬁnes the ontology elements used for representing information related to the contents published in social media (see Section 5.5). 4. The Users Module deﬁnes the ontology elements used for representing information related to social media users (see Section 5.6). 5. The Opinions Module deﬁnes ontology elements used for representing information related to opinions expressed within the contents (see Section 5.7). 6. The Topics Module deﬁnes the ontology elements used for representing information related to the topics that the contents are about (see Section 5.8). 7. Finally, the Locations Module deﬁnes the ontology elements used for representing information related to the geographical locations associated to users and contents (see Section 5.9). Before explaining the SGO modules in detail, we brieﬂy summarise next the notation used for describing the ontology. 85 de SGO Ontology Modules Opinions Topics and Keywords Publication Channels Contents Core Users Locations Figure 5.2: Social Graph Ontology Modeles 5.2 Notation Used We use UML [OMG, 2011] class diagrams for representing the elements contained within the ontology modules. Figure 5.3 shows an example class. The title of the box represents the class name (Site), preﬁxed by its namespace abbreviation (sioc). Within the box we ﬁnd the data type properties of the class (e.g. rank ) preﬁxed by their namespace abbreviation (e.g. sgo), and followed by their XML Schema data type52 (e.g. decimal ). The class may include a URL [Berners-Lee, 1994] property if its identiﬁer is dereferenceable. The criterium chosen for deciding if a class can be identiﬁed by a URL is the existence of a resource in the Web pointed by the URL. Optionally, a class with a non-dereferencable URI [Berners-Lee et al., 2005] may 52 http://www.w3.org/TR/xmlschema-2 86 sioc:Site URL sgo:rank: decimal sgo:monthlyVisitors: nonNegativeInteger sgo:pagesPerVisit: nonNegativeInteger sgo:visitsPerVisitor: nonNegativeInteger sgo:minutesPerVisitor: nonNegativeInteger sgo:backlinks: nonNegativeInteger sgo:percentageMale: decimal sgo:percentageFemale: decimal Figure 5.3: Class Example foaf:Agent foaf:name foaf:age foaf:mbox dcterms:language dcterms:description sgo:hasActivity sgo:Activity * dcterms:created 1 Figure 5.4: Object Property Example include a URI property with a clue on how can it be constructed from some of its properties in order to warranty identiﬁers’ uniqueness. Figure 5.4 shows an example object property represented by a labelled arrow with the name of the property (e.g. hasActivity) preﬁxed by its namespace abbreviation (sioc). The direction of the arrow is used for notating the domain and range of the property. The range is represented as the class pointed by the arrow (e.g. sgo:Activity), while the range is the other class (e.g foaf:Agent). Properties are annotated with its domain and range cardinalities. Some object properties may have inverse object properties. We notate these cases with a bidirectional arrow annotated with the name of the property and its inverse, as shown in Figure 5.5. Class inheritance is represented with UML notation as shown in Figure 5.6, where the classes foaf:Organisation and foaf:Person are subclasses of the class foaf:Agent. In an analogous way, property inheritance is represented as shown in Figure 5.7. In the example, the properties sioc:reply of, sioc:has reply, sioc:copies and sioc:shares are subproperties of the property sioc:related to. 87 sioc:Site sioc:Forum * sioc:has_host URI: concat(site, type) dcterms:type * sioc:host_of URL sgo:rank: decimal sgo:monthlyVisitors: nonNegativeInteger sgo:pagesPerVisit: nonNegativeInteger sgo:visitsPerVisitor: nonNegativeInteger sgo:minutesPerVisitor: nonNegativeInteger sgo:backlinks: nonNegativeInteger sgo:percentageMale: decimal sgo:percentageFemale: decimal Figure 5.5: Inverse Object Properties Example foaf:Agent foaf:name foaf:age foaf:mbox dcterms:language dcterms:description foaf:Person foaf:Organisation foaf:givenName foaf:familyName schema:jobTitle foaf:gender Figure 5.6: Class Inheritance Example Finally, instances are represented as shown in Figure 5.8 with underscored names for instances and dashed lines for the instantiation relationship. In the example the resources marl:Positive, marl:Neutral and marl:Negative are instances of the class marl:Polarity. 88 sgo:copies * sgo:shares * * * * sioc:content (language tagged) dcterms:identifier dcterms:created * dcterms:title dcterms:dateCopyrighted dcterms:medium sioc:num_views sioc:num_replies * geo:lat geo:long schema:wordCount schema:contentRating schema:articleBody schema:isFamilyFriendly sgo:numLikes: xsd:nonNegativeInteger sgo:numShares: xsd:nonNegativeInteger sgo:impact: xsd:decimal sgo:reach: xsd:nonNegativeInteger sgo:engagement: xsd:decimal sgo:relevance: xsd:decimal sgo:isPromotion: xsd:boolean * sioc:reply_of sioc:has_reply sioc:related_to sioc:Post Figure 5.7: Property Inheritance Example marl:Polarity marl:Positive marl:Neutral marl:Negative Figure 5.8: Instances Example 89 5.3 Core Ontology Module Figure 5.9 shows a UML representation of the core ontology module. The classes deﬁned by this module are the following: • The class sioc:UserAccount represents users accounts deﬁned for speciﬁc social media. The properties deﬁned for the sioc:UserAccount class are shown in Table 5.2. • The class sioc:Post represents speciﬁc contents published in publication channels by social media users. Such contents can take the form of text, video, image, etc. The properties deﬁned for this class are shown in Tables 5.3 and 5.4. • The class sioc:Forum represents publication channels into which users publish contents. The properties deﬁned for this class are shown in Table 5.5. • The class marl:Opinion represents opinions extracted from posts. The properties deﬁned for this class are shown in Table 5.6. • The class skos:Concept represents the subjects that contents are about, around which online communities are organised, or users are interested in. It also represents the speciﬁc entities (e.g. brands), which are opinionated by users. As an indeterminate number of types of subjects and entities may be opinionated by social media users, we have chosen not to create speciﬁc concept subclasses, but to annotate such concepts with standard semantic and syntactic categories. Further details are provided in Section 5.8. The properties deﬁned for this class are shown in Table 5.7. • The class sioc:Community represents online communities of users that share interest in speciﬁc topics. The object properties deﬁned for this class are shown in Table 5.8. • Finally, the class rdfg:Graph represents named graphs that correspond to speciﬁc social graphs instances. Such instances can be used for grouping speciﬁc data analysis projects. The data properties deﬁned for this class are shown in Table 5.9. 90 cd Social Graph Ontology rdfg:Graph URI rdfs:label dcterms:description dcterms:isPartOf skos:Concept * * * * * sioc:topic dcterms:isPartOf sioc:Forum sioc:has_subscriber URI: concat(site, type) dcterms:type * * sioc:has_container * * * * sioc:Post * * * URL sioc:content (language tagged) dcterms:identifier * dcterms:created dcterms:title dcterms:dateCopyrighted dcterms:medium * * * sioc:num_views * sioc:num_replies sioc:creator_of geo:lat * sioc:has_creator geo:long schema:wordCount dcterms:references schema:contentRating dcterms:contributor * schema:articleBody * schema:isFamilyFriendly sgo:numLikes: xsd:nonNegativeInteger sgo:numShares: xsd:nonNegativeInteger * marl:hasOpinion * marl:Opinion sgo:impact: xsd:decimal marl:extractedFrom URI: hash(post, text) sgo:reach: xsd:nonNegativeInteger marl:optinionText sgo:engagement: xsd:decimal marl:polarityValue sgo:relevance: xsd:decimal sgo:isPromotion: xsd:boolean * * sioc:topic * sioc:related_to * URI: concat(site, accountName) foaf:nick foaf:accountName * dcterms:created dcterms:modified sgo:verified: xsd:boolean sgo:private: xsd:boolean sgo:outreach: xsd:decimal sgo:influence: xsd:decimal sgo:numPosts: xsd:nonNegativeInteger sgo:numFollowers: xsd:nonNegativeInteger * sgo:numFollowing: xsd:nonNegativeInteger sgo:numLikes: xsd:nonNegativeInteger sgo:declaredLocation: xsd:string sgo:shares * * sioc:UserAccount marl:describesObject Figure 5.9: Core ontology module of the SGO 91 dcterms:contributor sgo:copies sioc:follows sioc:subscriber_of * sioc:container_of sioc:reply_of sioc:has_reply * * * sioc:Community sioc:topic URI: concat(language, prefLabel) skos:prefLabel (language tagged) Property URI sioc:follows sioc:subscriber of sioc:topic dcterms:isPartOf dcterms:contributor foaf:nick foaf:accountName foaf:page foaf:avatar sioc:account of sioc:has function dcterms:created dcterms:modiﬁed sgo:veriﬁed sgo:private sgo:outreach sgo:inﬂuence sgo:numPosts sgo:numFollowers sgo:numFollowing sgo:numLikes sgo:withHeldIn sgo:declaredLocation Description An instance of sioc:UserAccount can be uniquely identiﬁed by a URI constructed with the URL of the website in which the user account is registered, together with the account name of the user in the site. User account followed by the user account being described, as for example a Facebook friend or a Twitter followee. Publication channel to which the user account is subscribed. Subject in which the owner of the user account is interested. Online community to which a user belongs. Other account that can contribute to the content published by the user account being described, or that can published in its name. Nick of the user in the publication channel (e.g. the screen name in the case of Twitter). Id of the user in the publication channel (in the case of Facebook and Twitter numeric identiﬁers are used). Web page that describes the user proﬁle in the publication channel being deﬁned. An image that represents the user in the publication channel. Person or organisation that owns the user account. Role that the user plays in the publication channel (e.g. inﬂuencer, owner, etc.). Date and time of creation of the user account. Date and time of modiﬁcation of the user account. Determines if the publication channel has veriﬁed the person or organisation that has been declared as the owner of the user account. Determines whether the proﬁle deﬁned by the user account and the content produced can be only accessed by authorised users, or are publicly available in the Web. KPI that measures the overall outreach of the user account in terms of outreach metrics like the one provided by Kred (http: //kred.com). KPI that measures the overall inﬂuence of the user account in terms of inﬂuence metrics like the one provided by Klout (http: //klout.com) or Kred. KPI that measures the number of posts published by the user account. KPI that measures the number of followers of the user account. KPI that measures the number of user accounts followed by the user account being described. Number of likes that hasve been received by the user account. Country in which the user account has been banned due to legal restrictions, etc. Location declared by a user in her/his proﬁle of a given social medium. Table 5.2: Properties of the class sioc:UserAccount 92 Property URL sioc:has container sioc:has creator dcterms:contributor dcterms:references sioc:related to sioc:reply of sioc:has reply sgo:shares sgo:copies marl:hasOpinion sioc:topic sioc:content sioc:links to dcterms:identiﬁer dcterms:created dcterms:dateCopyrighted Description Posts can be uniquely identiﬁed by the URL of the web resources that annotate. Channel in which the post has been published. This property is the inverse of the propertysioc:container of, which has been deﬁned by Table 5.5. User account that has published the post being described. This property is the inverse of the propertysioc:creator of, which has been deﬁned by Table 5.2. User account that has contributed to the post being described. User account being mentioned in the post. Other post related to the content being described. Publication of which the post being described is a reply. This property is a sub property of the propertysioc:related to. Post that is a reply of the content being described. This property is also a sub property of sioc:related to, and the inverse of the property sioc:reply of. Post that is being spread by the post being described, for example by using a retweet when disseminating through Twitter. This property is also a sub property of sioc:related to, and the inverse of the property sioc:reply of. Other post whose content has been copied fully of partially in the post being described, without explicitly declaring it in content’s metadata (e.g. by setting the retweet ﬂag when the publication channel is Twitter). This property is also a sub property of sioc:related to. Object property that relates the post with an opinion contained in it. Keyword included in the content of post, or subject that the post is about. Textual content of the post. The value of this property may be annotated with its language according to the mechanisms provided by RDF for tagging the language of string literals. Multimedia content (videos, photos, etc.) linked from the post. Identiﬁer assigned by the publication channel to the post. Publication date of the post. Copyright date of the post. Table 5.3: Properties of the class sioc:Post (1/2) 93 Property dcterms:medium foaf:based near geo:lat geo:long schema:articleBody sioc:num views sioc:num replies sgo:numLikes sgo:numShares sgo:impact sgo:reach sgo:engagement sgo:relevance schema:wordCount sgo:isPromotion schema:isFamilyFriendly sgo:withHeldIn sgo:contentRating Description Main format of the content (text, video, etc.). Location from which the content has been published. Geographical latitude from which the content has been published. Geographical longitude from which the content has been published. Content of the post in HTML format. KPI that measures the number of views of the content. KPI that measures the number of replies to the content. KPI that measures number of times that the content has been liked. KPI that measures the number of times that the content has been shared. KPI that measures the degree in which the content has been viewed and shared. KPI constructed from the summatory of the inﬂuence of the author of the post and of the users that have disseminated the post. KPI that measures the engagement of the content. KPI that measures the relevance of the content. It is calculated as an aggregation of the KPIs of the post, the author and the site. Number of words included in the content. Indicates if the post contains and advertising message. Indicates if the post does not include sensible content (e.g. violence). Country in which the post has been banned due to legal restrictions, etc. Rating of the post according to its publication channel (e.g. Twitter rates the tweets according to its degree of dissemination). Table 5.4: Properties of the class sioc:Post (2/2) Property URI sioc:has subscriber sioc:container of dcterms:type sioc:has host Description An instance of the class sioc:Forum can be uniquely identiﬁed by a URI constructed with the URL of the website to which the publication channel belongs together with the type of publication channel. User account that is subscribed to the publication channel being described. This property is the inverse of the property sioc:subscriber of, which has been deﬁned by Table 5.2. Post published within the publication channel being described. Type of publication channel (e.g. weblog, microblog, social network, etc.). Website to which the publication channel belongs. Table 5.5: Properties of the class sioc:Forum 94 Property URI marl:extractedFrom marl:describesObject marl:opinionText marl:polarityValue marl:hasPolarity onyx:hasEmotionCategory sgo:hasPurchaseStage sgo:hasMarketingMixAttribute Description Hashes constructed from the concatenation of the URL of the posts where the opinion has been described and the text of the opinion may uniquely identify instances of marl:opinion. Post from which the opinion has been extracted. This property is the inverse of the property marl:hasOpinion, which has even deﬁned by Table 5.3 Entity that is the object of the opinion being described (e.g. a brand or product). Text of the opinion. Numeric value of the opinion polarity. The Marl ontology speciﬁcation (http://www.gsi.dit.upm.es/ontologies/marl/) recommends using a real number in the interval [0, 1] for this value. Category of the opinion polarity (i.e. positive, negative or neutral) Kind of emotion expressed in the opinion. Purchase stage in the Consumer Decision Journey. Marketing Mix attribute. Table 5.6: Properties of the class marl:Opinion Property URI skos:prefLabel isocat:datcat Description An instance of the class skos:Concept can be uniquely identiﬁed by a URI that includes the language and the label of the topic or keyword. Label of the concept. The value of this property may be annotated with its language according to the mechanisms provided by RDF for tagging the language of string literals. Lexical or semantic classiﬁcation of the concept expressed according to a ISOcat [Kemps-Snijders et al., 2008] category. The categories reused by this module are the following: verb, adjective, noun, common noun, proper noun, named entity, location, organisation, person, male, female, metadata tag, trademark (i.e. brand or product), and domain (i.e. business sector). Table 5.7: Properties of the class skos:Concept Property sioc:topic dcterms:isPartOf Description Subject (or topic) around which the community has been constructed. Broader community of which the community being described is part. Table 5.8: Properties of the class sioc:Community Property URI rdfs:label dcterms:description Description Used for uniquely identifying the graph Name assigned to the social graph instance. Text that describes the social graph instance. Table 5.9: Properties of the class rdfg:Graph 95 5.4 Publication Channels Module This module describes the classes and properties related with content publication channels, i.e. sites and sections within sites where social media contents are published. Figure 5.10 shows a UML representation of this module, which includes the class sioc:Site that describes websites. The properties deﬁned for this class are shown in Table 5.10. cd Publication Channels sioc:Forum URI: concat(site, type) dcterms:type sioc:Site * sioc:has_host * sioc:host_of URL sgo:rank: decimal sgo:monthlyVisitors: nonNegativeInteger sgo:pagesPerVisit: nonNegativeInteger sgo:visitsPerVisitor: nonNegativeInteger sgo:minutesPerVisitor: nonNegativeInteger sgo:backlinks: nonNegativeInteger sgo:percentageMale: decimal sgo:percentageFemale: decimal Figure 5.10: Publication Channels module of the SGO Property URL sgo:rank sgo:monthlyVisitors sgo:visitsPerVisitor sgo:pagesPerVisit sgo:minutesPerVisitor sgo:backlinks sgo:percentageMale sgo:percentageFemale sioc:host of Description The URLs of the websites are used for identifying the instances of this class. KPI that ranks the site according to a relevance metric like Google’s PageRank [Page et al., 1999] or MozRank (http://moz. com/learn/seo/mozrank). KPI that measures the average number of unique visitors to the site per month. The correct identiﬁcation of unique visitors may rely on techniques like the ones described in Section 2.2.2, or on the contribution of this thesis to the State of the Art described in Chapter 7. KPI that measures the average number of visits to the site per visitor and month. KPI that measures the average number of pages viewed by a visitor per visit. KPI that measures the average time in minutes spent by a visitor of the site per visit. KPI that measures the number of links to the site from other web pages. Percentage of male visitors. Percentage of female visitors. Publication channels that belong to the site being described. This property is the inverse of the property sioc:has host, which has been deﬁned by Table 5.5. Table 5.10: Properties of the class sioc:Site 96 5.5 Contents Module This module describes the classes and properties related with the contents published in social media. Figure 5.11 shows a UML representation of this module. The classes deﬁned by this module are the following: • The class foaf:Document represents any kind of multimedia document published online. The properties deﬁned for this class are shown in Table 5.11. • The class schema:Review is used for creating posts annotations by social media analysts, community managers, or CRM operators. The properties deﬁned for this class are shown in Table 5.12. The classes tzont:PoliticalRegion and tzont:Country are deﬁned within in the module that deals with geographical locations (see Section 5.9). cd Contents sioc:Post sioc:links_to * URL sioc:content (language tagged) dcterms:identifier dcterms:created dcterms:title * dcterms:dateCopyrighted dcterms:medium sioc:num_views * sioc:num_replies geo:lat geo:long schema:wordCount schema:contentRating schema:articleBody schema:isFamilyFriendly * sgo:numLikes: xsd:nonNegativeInteger sgo:numShares: xsd:nonNegativeInteger sgo:impact: xsd:decimal sgo:reach: xsd:nonNegativeInteger sgo:engagement: xsd:decimal sgo:relevance: xsd:decimal sgo:isPromotion: xsd:boolean * foaf:Document URL foaf:based_near * sgo:withheldIn * tzont:PoliticalRegion tzont:Country URI: (identifier) schema:Review schema:review dcterms:created * dcterms:creator schema:reviewBody schema:keywords sgo:starred: xsd:boolean sgo:checked: xsd:boolean sgo:status: rdfs:Literal sgo:priority: rdfs:Literal Figure 5.11: Contents module of the SGO 97 Property URL Description Documents can be uniquely identiﬁed by their URLs of the resources that annotate. Table 5.11: Properties of the class foaf:Document Property dcterms:created dcterms:creator dcterms:reviewBody schema:keywords sgo:starred sgo:checked sgo:status sgo:priority Description Date of creation of the review. Name of the reviewer. Text of the review. Tags assigned by the reviewer to the post. Indicates if the post has been highlighted by the reviewer. Indicates if the review task has ﬁnished. Status of the actions derived from the review. Priority of the review. Table 5.12: Properties of the class schema:Review 5.6 Users Module This module describes the classes and properties related with social media users. Figure 5.12 shows a UML representation of this module. The classes deﬁned by this module are the following: • The class sioc:Role represents roles that the user accounts play in social media, like inﬂuencer, content propagator, etc. The property deﬁned for this class is shown in Table 5.13. • The class foaf:Agent deﬁnes persons or organisations that own user accounts. The properties deﬁned for this class are shown in Table 5.14. • The class foaf:Organisation is used for describing organisations. We do not have deﬁned additional properties for the class foaf:Organisation. This class is a subclass of foaf:Agent. • The class foaf:Person is used for deﬁning persons. This class is a subclass of foaf:Agent. The properties deﬁned for this class are shown in Table 5.15. • The class foaf:Image is a subclass of the class foaf:Document, which has been described in Section 5.5. This class is used for deﬁning images assigned to user accounts. 98 • The class foaf:PersonalProﬁleDocument is a also a subclass of the class foaf:Document. This class is used for deﬁning web pages that describe user accounts. • The class sgo:Activity is used for registering an activity record captured by a tracking server. This activity record can be associated to a cookie, a ﬁngerprint, or both. The properties deﬁned for this class are shown in Table 5.16. • The class sgo:Cookie is used for describing cookies installed in web browsers used by users. A cookie is used in the context of this thesis as a mechanism for uniquely identifying browsers, as it has been described in Section 2.2.2.1. The properties deﬁned for this class are shown in Table 5.17. • The class sgo:Fingerprint is used for describing device ﬁngerprints. The properties deﬁned for this class are shown in Table 5.18. The classes tzont:PoliticalRegion and tzont:Country are deﬁned within in the module that deals with geographical locations (see Section 5.9). 99 cd Users * sioc:function_of * sioc:UserAccount URI: concat(site, accountName) foaf:nick foaf:accountName dcterms:created dcterms:modified * sgo:verified: xsd:boolean * sgo:private: xsd:boolean sgo:outreach: xsd:decimal sgo:influence: xsd:decimal sgo:numPosts: xsd:nonNegativeInteger sgo:numFollowers: xsd:nonNegativeInteger sgo:numFollowing: xsd:nonNegativeInteger sgo:numLikes: xsd:nonNegativeInteger sgo:declaredLocation: xsd:string * * sioc:avatar foaf:Image tzont:Country sgo:withheldIn * URI: (identifier) * sgo:Fingerprint URI: hash(all attributes) dcterms:created sgo:xRealIP: xsd:NMTOKEN sgo:xForwardedFor: xsd:NMTOKEN sgo:userAgent: xsd:string sgo:accept: xsd:string sgo:acceptLanguage: xsd:string sgo:acceptCharset: xsd:string sgo:acceptEncoging: xsd:string sgo:cacheControl: xsd:string sgo:plugins: xsd:string sgo:fonts: xsd:string sgo:video: xsd:string sgo:timeZone: xsd:string sgo:sessionStorage: xsd:boolean sgo:localStorage: xsd:boolean sgo:iePersistence: xsd:boolean foaf:page * * sioc:Role sioc:has_function foaf:PersonalProfileDocument foaf:Document URL * foaf:based_near sioc:account_of foaf:page * foaf:Agent foaf:account * sgo:hasFingerprint * sgo:hasActivity foaf:name foaf:age foaf:mbox dcterms:language dcterms:description 1 * tzont:PoliticalRegion sgo:Activity * dcterms:created sgo:hasCookie * 0..1 sgo:Cookie * foaf:Organisation 0..1 foaf:Person foaf:givenName foaf:familyName schema:jobTitle foaf:gender URI: hash(label,value,domain,path) rdfs:label dcterms:created dcterms:valid sgo:value: xsd:string sgo:domain: xsd:NMTOKEN sgo:path: xsd:normalizedString sgo:isSecure: xsd:boolean sgo:httpOnly: xsd:boolean Figure 5.12: Users module of the SGO 100 Property foaf:function of Description User account that plays the role being described. This property is the inverse of the property sioc:has function, which has been deﬁned by Table 5.2. Table 5.13: Property of the class sioc:Role Property foaf:name foaf:age foaf:mbox foaf:page foaf:account foaf:based near dcterms:language dcterms:description sgo:hasActivity Description Name of the agent. Age of the agent. E-mail of the agent. Web page owned by the agent (e.g. weblog, homepage, etc.). A user account owned by the agent. This property is the inverse of sioc:account of, which has been deﬁned by Table 5.13. Normalised geographical location of the agent (e.g. place of residence). In Section 8.5 we provide a technique for identifying the place of residence of social media users. Language spoken by the agent. Description declared by the user about herself/himself in her/his proﬁle of the social medium. Activity record registered for the agent. Table 5.14: Properties of the class foaf:Agent Property foaf:givenName foaf:familyName schema:jobTitle foaf:gender Description Given name (e.g. ﬁrst name) of the person being described. Family name (e.g. last name) of the person being described. Profession of the person being described. Gender of the person ( “male” or “female”). In Section 8.6 we provide a technique for identifying the gender of social media users. Table 5.15: Properties of the class foaf:Person Property dcterms:created sgo:hasCookie sgo:hasFingerprint Description Timestamp in which the activity record has been gathered. It is deﬁned at the granularity of milliseconds. Cookie assigned to a web browser when registering the activity. Fingerprint of a device when registering the activity. Table 5.16: Properties of the class foaf:Activity 101 Property URI rdfs:label dcterms:created dcterms:valid sgo:value sgo:domain sgo:path sgo:isSecure sgo:httpOnly Description An instance of the class sgo:Cookie can be uniquely identiﬁed by a URI constructed with a hash created from the name, the value, the domain and the path of the cookie. Name of the cookie. Date and time of creation of the cookie. Expiry date and time of the cookie. Value assigned to the cookie. Domain scope of the cookie. Path scope of the cookie. Determines whether the cookie can only be sent using secure connections. Determines whether the cookie can only be sent through HTTP [Fielding and Reschke, 2014a]. Table 5.17: Properties of the class sgo:Cookie Property URI dcterms:created sgo:xRealIP sgo:xForwardedFor sgo:userAgent sgo:accept sgo:acceptLanguage sgo:acceptCharset sgo:acceptEncoding sgo:cacheControl sgo:plugins sgo:fonts sgo:video sgo:timeZone sgo:sessionStorage sgo:localStorage sgo:iePersistence Description An instance of the class sgo:Fingerprint can be uniquely identiﬁed by a URI constructed with a hash created from all ﬁngerprint attributes (i.e. the ones described in Section 2.2.2.2) Date and time of creation of the ﬁngerprint. IP address [Postel, 1981] of the user’s device. IP address of the user’s device followed by the IP addresses of the proxy servers between the device and the web server that has registered the ﬁngerprint. Information about the device (browser, operating system, etc.) used by the user. Kind of content requested by the device to the web server when such server registered the ﬁngerprint. Language expected by the device. Charset expected by the device. Encoding or compression format expected by the device. Directive that speciﬁes the caching mechanisms to be applied along the request-response chain. Plugins installed in the web browser used by the device. Fonts installed in the device. Video settings of the device. Time zone of the device’s user. Indicates if the device supports data persistence which is available during a navigation session. Indicates if the device supports persistent data which is available beyond a navigation session. Indicates whether the device supports data persistence when the browser is Internet Explorer. Table 5.18: Properties of the class sgo:Fingerprint 102 5.7 Opinions Module This module describes the classes and properties related with the opinions expressed by consumers in their posts. Figure 5.13 shows a UML representation of this module. The classes deﬁned by this module are the following: • The class marl:Polarity indicates the polarity of the opinion. There are three possible instances of this class: marl:Positive, marl:Negative and marl:Neutral. • The class onyx:EmotionCategory is used for indicating the kind of emotion expressed within an opinion according to the categories deﬁned in the State of the Art (see Table 2.4). Therefore, we have deﬁned the following instances for this class: sgo:Satisfaction, sgo:Dissatisfaction, sgo:Love, sgo:Hate, sgo:Happiness, sgo:Sadness, sgo:Trust and sgo:Fear. This thesis provides a technique for identifying these emotion categories in Section 8.4. • The class sgo:PurchaseStage is used for indicating the purchase stage expressed by a consumer according to the categories deﬁned in the State of the Art (see Figure 2.6). Therefore, we have deﬁned the following instances for this class: sgo:Awareness, sgo:Evaluation, sgo:Purchase and sgo:PostpurchaseExperience. This thesis provides a technique for identifying these Consumer Decision Journey stages in Section 8.2. • The class sgo:MarketingMixAttribute is used for indicating the Marketing Mix attributes to which consumers refer within their opinions according to the categories deﬁned in the State of the Art (see Table 2.3). Therefore, we have deﬁned the following instances for this class: sgo:CustomerService, sgo:Sponsorship, sgo:Quality, sgo:Promotion, sgo:Advertisement, sgo:Price, sgo:Design, sgo:PointOfSale, sgo:Warranty and sgo:LoyaltyMarketing. This thesis provides a technique for identifying these purchase stages in Section 8.3, with the exception of Warranty and Loyalty Marketing, which are out of the scope. 103 cd Opinions sgo:Satisfaction sgo:Dissatisfaction sgo:Love sgo:Hate sgo:Happiness sgo:Sadness sgo:Trust sgo:Fear onyx:EmotionCategory * onyx:hasEmotionCategory * sgo:hasPurchaseStage sgo:PurchaseStage marl:hasPolarity marl:Opinion * URI: hash(post, text) marl:optinionText marl:polarityValue 0..1 sgo:Awareness marl:Polarity * marl:Positive * sgo:Evaluation * marl:Neutral sgo:PostpurchaseExperience marl:Negative sgo:hasMarketingMixAttribute sgo:Purchase * sgo:MarketingMixAttribute sgo:Design sgo:Quality sgo:Sponsorship sgo:CustomerService sgo:Price sgo:Promotion sgo:Advertisement sgo:PointOfSale sgo:Warranty sgo:LoyaltyMarketing Figure 5.13: Opinions module of the SGO 104 5.8 Topics and Keywords Module This module describes the instances used for annotating the topics and keywords included in social media content. Figure 5.14 shows a UML representation of this module. Note that there exists part-of relationships between some of the categories show in the ﬁgure (e.g. between noun types). The deﬁnition of this mereology is out of the scope of this work since they are speciﬁed by ISOCat [Kemps-Snijders et al., 2008]. cd Topics and Keywords skos:Concept URI: concat(language, prefLabel) skos:prefLabel (language tagged) * * isocat:datcat rdfs:Resource isocat:DC-1424 isocat:DC-1230 (verb) (adjective) isocat:DC-1333 isocat:DC-1256 (noun) (common noun) isocat:DC-1371 isocat:DC-4339 (proper noun) (location) isocat:DC-2275 isocat:DC-2979 (named entity) (Organisation) isocat:DC-2978 isocat:DC-2950 (Person) (female) isocat:DC-2949 isocat:DC-414 (male) (trademark) isocat:DC-5436 isocat:DC-2212 (metadata tag) (domain) Figure 5.14: Topics and Keywords module of the SGO 105 5.9 Geographical Locations Module This module describes the classes and properties related with the locations of users and contents. Figure 5.15 shows a UML representation of this module. The classes deﬁned by this module are the following: • The class tzont:PoliticalRegion represents a location that corresponds to any kind of political region (e.g. country, state, city). The properties deﬁned for this class are shown in Table 5.19. • The class tzont:Country represents a political region that corresponds to a country. The properties deﬁned for this class are shown in Table 5.20. • The class tzont:State represents a political region that corresponds to an administrative region of ﬁrst level within a country (e.g. state, autonomous community). The properties deﬁned for this class are shown in Table 5.21. • The class tzont:County represents a political region that corresponds to an administrative region of second level within a country (e.g. county, province). The properties deﬁned for this class are shown in Table 5.22. • The class tzont:City represents a political region that corresponds to an administrative region of third level within a country (e.g. city, town, village, settlement). The properties deﬁned for this class are shown in Table 5.23. • The class schema:Continent represents a continent of the world. The properties deﬁned for this class are shown in Table 5.24. • The class tzont:TimeZone represents a time zone to which a political region belongs. The properties deﬁned for this class are shown in Table 5.25. 106 cd Locations tzont:hasParentRegion * tzont:TimeZone URI: (label) rdfs:label tzont:GMToffset * tzont:PoliticalRegion tzont:hasTimeZone dcterms:identifier rdfs:label * * geo:lat geo:long * tzont:hasParentRegion tzont:Country schema:Continent URI: (identifier) * URI: (identifier) dcterms:identifier rdfs:label tzont:State URI: concat(country, identifier) tzont:County URI: concat(country, state, identifier) tzont:City URI: concat(country, state, city, identifier) Figure 5.15: Locations module of the SGO Property dcterms:identiﬁer rdfs:label geo:lat geo:long tzont:hasParentRegion tzont:hasTimeZone Description Identiﬁer of the political region. Name of the political region. Representative latitude of the political region. Representative longitude of the political region. Region (political or continent) to which a political region belongs. This property is used for modelling the part-of relationship among geographical political entities (City, County, State and Country). Time zone to which a political region belongs. Table 5.19: Properties of the class tzont:PoliticalRegion 107 Property URI Description The instances of the class tzont:Country can be uniquely identiﬁed by a URI constructed from the identiﬁer of the country. Table 5.20: Properties of the class tzont:Country Property URI Description The instances of the class tzont:State can be uniquely identiﬁed by a URI constructed from the identiﬁers of the country and the state. Table 5.21: Properties of the class tzont:State Property URI Description The instances of the class tzont:County can be uniquely identiﬁed by a URI constructed from the identiﬁers of the country, the state, and the county. Table 5.22: Properties of the class tzont:County Property URI Description The instances of the class tzont:City can be uniquely identiﬁed by a URI constructed from the identiﬁers of the country, the state, the county and the city. Table 5.23: Properties of the class tzont:City Property URI dcterms:identiﬁer rdfs:label Description The instances of the class schema:Continent can be uniquely identiﬁed by a URI constructed from the identiﬁer of the continent. Identiﬁer of the continent. Name of the continent. Table 5.24: Properties of the class schema:Continent Property URI rdfs:label tzont:GMToﬀset Description The instances of the class tzont:TimeZone can be uniquely identiﬁed by a URI constructed from the name of the time zone. Name of the time zone. Diﬀerence of the time zone from Greenwich Meridian Time (GMT). Table 5.25: Properties of the class tzont:TimeZone 108 Chapter 6 MORPHOSYNTACTIC CHARACTERISATION OF SOCIAL MEDIA CONTENTS In this chapter, we make use of a part-of-speech tagger to process and characterise a corpus of user-generated content extracted from diﬀerent social media sources. Speciﬁcally, we have studied diﬀerences in the language used in distinct types of social media content by analysing the distribution of part-of-speech categories in such sources. The chapter is structured as follows: • Firstly, Section 6.1 describes the kinds of social media that we have compared, from which we have extracted the contents to be analysed. • Secondly, Section 6.2 explains the distributions of part-of-speech categories by type of social media. • Finally, Section 6.3 presents the conclusions of the analysis, validating the ﬁrst hypothesis of this thesis: the contents published in social media statistically present diﬀerent morphosyntactic features depending on the speciﬁc kind of media where they have been published. 109 6.1 Types of Social Media Analysed We have characterised the following types of social media by extracting and analysing a random sample of 10,000 textual contents published on them, uniformly distributed among the following media types: Blogs. We have extracted the texts of posts published in feeds of blog publishing platforms such as Wordpress53 and Blogger54 . Content published in these sites usually consists on medium-sized posts and small comments about such posts. Forums. We have scrapped the text of the comments published in web forums constructed with vBulletin55 and phpBB56 technologies. Content published in these sites consists in dialogues between users in the form of a timely ordered sequence of small comments. Microblogs. We have extracted the short messages published in Twitter and Tumblr57 by querying their APIs. Content published in these sources consists on small pieces of text (e.g. maximum 140 characters for Twitter). Social networks. We have extracted the messages published in Facebook and Google Plus58 by querying their APIs. Content published in these sites goes from small statuses or comments to medium-sized posts. Review sites. We have scrapped the text of the comments published in Ciao59 , Dooyoo60 and reviews published in Amazon61 . The length of the content published in these sites is also variable. 53 http://wordpress.org http://www.blogger.com 55 http://www.vbulletin.com 56 http://www.phpbb.com 57 http://www.tumblr.com 58 http://plus.google.com 59 http://www.ciao.com 60 http://www.dooyoo.com 61 http://www.amazon.com 54 110 Audio-visual content publishing sites. We have extracted the textual comments associated to the audio-visual content published in YouTube62 and Vimeo63 . Textual content published in these sites takes the form of small textual comments. News publishing sites. We have extracted the articles from the feeds published in such sources. Sites of this kind can be classiﬁed as traditional editorially controlled media. However, comments posted by article readers can be catalogued as user-generated content. Thus, content published in news sites consists on articles and small comments about such articles. Other sites not classiﬁed in the categories above (e.g. Content Management Systems) that publish their content as structured feeds, or that have a known HTML structure from which a scrapping technique can be applied. The content published in these sites is heterogeneous. 6.2 Distribution of Part-of-Speech Categories For performing the study of the distribution of part-of-speech (PoS) categories in user-generated content, we have collected a corpus with 10, 000 posts written in Spanish, obtained from the sources described in the previous section. The posts extracted are related to the telecommunications domain. We have performed the PoS analysis by implementing a GATE [Cunningham et al., 2011] pipeline, with TreeTagger [Schmid, 1994] as the PoS tagger. Therefore, the PoS distributions obtained are based on an automatic tagger. A previous work by Garc´ıa Moya [2008] includes an evaluation of TreeTagger with a Spanish parameterisation when applied to a corpus of news articles. The precision, recall and F-measure obtained on such evaluation were 0.8831, 0.8733 and 0.8782, respectively. Table 6.1 shows the distributions obtained. The TreeTagger tag-set for Spanish determines the PoS categories. As shown in the table, there are variations in the distribution of these categories with respect to the publication source. 64 62 http://www.youtube.com http://vimeo.com 64 ftp://ftp.ims.uni-stuttgart.de/pub/corpora/spanish-tagset.txt 63 111 Table 6.1: Distribution of part-of-speech categories by social media type PoS Category Noun Common Proper Foreign word Measure unit (e.g. GHz) Month name (e.g. Feb) Acronym (e.g. UN) Letter of the alphabet (e.g. b) Alphanumeric code (e.g. A4) Symbol (e.g. $, £) Adjective Quantity ordinal Quantity cardinal Quantity other Other Adverb Negation Other Determiner Conjunction Adversative coordinating Negative coordinating Other coordinating ”que” Subordinating (ﬁnite clauses) Subordinating (inﬁnite clauses) Other subordinating Pronoun Demonstrative Interrogative Personal (clitic) Personal (non-clitic) Posessive Relative Preposition Portmanteau word “al” Portmanteau word “del” Other Punctuation mark Full stop Comma Colon Semicolon Dash Ellipsis Slash Percent sign Left parenthesis Rigth parenthesis Quotation symbol Verb To be (“estar”) To have (“haber”) Lexical past participle Lexical ﬁnite Lexical gerund Lexical inﬁnitive Modal To be (“ser”) past part. To be (“ser”) inﬁnitive To be (“ser”) other “Se” (as particle) News 30.9% 53.3% 42.3% 0.2% 0.2% Blogs 30.0% 56.9% 37.3% 0.5% 0.8% Audiov. 29.0% 50.5% 42.9% 1.4% 0.0% Reviews 23.2% 71.5% 23.8% 0.5% 0.6% Microbl. 33.7% 50.4% 36.1% 1.8% 0.1% Forums 22.0% 68.8% 25.7% 0.9% 0.2% Other 26.6% 60.9% 34.1% 0.7% 0.2% S. Net. 32.7% 50.2% 43.1% 1.0% 0.2% All 27.4% 59.2% 34.6% 0.8% 0.3% 0.5% 1.1% 0.4% 0.1% 0.1% 0.3% 0.5% 0.4% 0.4% 0.3% 0.6% 0.5% 1.1% 0.5% 2.3% 0.1% 1.0% 0.3% 4.0% 0.5% 1.7% 0.3% 1.0% 0.5% 1.9% 0.3% 1.5% 2.2% 1.5% 1.9% 0.9% 1.1% 1.2% 1.9% 1.1% 1.5% 0.4% 8.6% 4.6% 34.7% 7.5% 53.3% 2.5% 18.2% 81.8% 11.5% 6.1% 2.4% 0.3% 8.3% 2.7% 30.6% 12.0% 54.8% 3.4% 18.1% 81.9% 9.8% 7.8% 3.1% 0.1% 6.4% 1.4% 28.5% 14.5% 55.6% 3.2% 29.7% 70.3% 7.6% 6.6% 3.9% 1.4% 8.2% 1.5% 22.0% 23.6% 53.0% 4.9% 23.9% 76.1% 8.0% 9.7% 5.7% 6.1% 9.4% 0.4% 33.0% 7.4% 59.1% 3.9% 36.2% 63.8% 5.8% 6.2% 7.0% 0.7% 7.1% 1.1% 24.8% 23.3% 50.8% 4.5% 30.0% 70.0% 8.0% 10.1% 5.7% 0.5% 8.4% 1.7% 34.3% 13.8% 50.1% 3.7% 30.6% 69.4% 8.7% 8.7% 4.1% 1.5% 6.2% 1.1% 25.5% 19.3% 54.1% 3.4% 29.1% 70.9% 7.5% 7.4% 3.7% 1.3% 8.0% 1.9% 29.6% 15.7% 52.9% 3.8% 27.4% 72.6% 8.5% 8.3% 4.6% 0.3% 0.9% 0.7% 1.5% 1.0% 1.5% 1.3% 1.3% 1.2% 44.3% 28.5% 2.2% 44.2% 26.9% 3.1% 36.6% 27.0% 1.6% 29.3% 34.4% 4.4% 36.6% 26.1% 1.4% 32.5% 31.7% 3.0% 38.9% 29.5% 2.9% 41.6% 26.7% 2.2% 36.7% 30.1% 3.0% 10.6% 9.7% 18.7% 10.8% 10.7% 11.1% 10.2% 12.0% 10.8% 11.7% 1.9% 23.7% 0.7% 17.1% 15.7% 38.4% 4.3% 15.2% 3.8% 12.0% 3.4% 24.3% 0.9% 16.0% 22.1% 34.3% 2.4% 14.6% 3.1% 11.5% 5.0% 15.4% 0.0% 11.4% 37.3% 33.0% 2.8% 11.8% 3.4% 13.9% 5.6% 20.2% 0.8% 11.4% 44.3% 21.2% 2.1% 12.7% 2.8% 17.2% 4.7% 15.1% 1.8% 16.3% 42.9% 22.0% 1.9% 8.2% 2.1% 14.6% 5.8% 13.9% 1.1% 17.2% 50.3% 15.9% 1.6% 11.9% 2.8% 13.1% 4.3% 18.3% 0.6% 14.6% 39.0% 24.8% 2.7% 12.9% 3.1% 12.6% 4.4% 16.2% 0.8% 12.8% 42.5% 24.6% 3.1% 11.5% 3.0% 13.5% 4.4% 17.8% 0.9% 14.6% 40.8% 23.4% 2.4% 12.6% 3.1% 7.6% 4.2% 3.9% 4.5% 3.2% 3.9% 4.3% 4.8% 4.8% 88.6% 10.7% 92.7% 8.5% 92.8% 12.9% 92.6% 9.4% 94.7% 8.3% 93.3% 9.2% 92.6% 9.7% 92.3% 10.5% 92.1% 9.7% 4.9% 48.9% 3.8% 1.0% 2.5% 2.9% 0.5% 1.3% 13.4% 13.4% 7.5% 12.0% 1.6% 5.8% 16.0% 17.1% 54.5% 3.8% 0.9% 1.4% 4.3% 0.0% 1.1% 6.2% 6.2% 4.5% 13.8% 1.9% 3.5% 13.4% 41.5% 29.1% 2.4% 1.3% 3.4% 7.7% 0.0% 0.0% 5.2% 4.7% 4.6% 16.8% 0.5% 2.4% 11.7% 8.7% 50.1% 5.4% 0.5% 1.5% 8.8% 0.6% 0.9% 8.8% 8.6% 6.2% 17.8% 1.7% 3.5% 10.2% 29.8% 25.2% 13.9% 0.6% 0.7% 16.3% 3.8% 0.0% 2.1% 4.1% 3.5% 19.1% 1.1% 2.0% 5.8% 25.5% 44.1% 4.8% 0.5% 2.1% 8.4% 0.1% 0.7% 5.1% 5.5% 3.2% 20.5% 1.5% 3.2% 10.0% 13.2% 44.7% 5.2% 0.5% 3.6% 6.2% 0.3% 1.3% 11.1% 11.0% 2.9% 16.4% 1.5% 3.9% 12.2% 25.0% 33.8% 15.2% 0.6% 3.3% 9.2% 0.1% 0.4% 4.1% 4.6% 3.6% 16.0% 1.3% 1.9% 8.9% 16.8% 43.7% 6.6% 0.7% 2.4% 7.4% 0.5% 0.9% 8.1% 8.3% 4.5% 16.7% 1.5% 3.4% 10.8% 47.2% 1.0% 20.4% 1.5% 0.6% 48.8% 0.7% 22.9% 1.8% 0.3% 48.5% 0.3% 28.1% 0.8% 0.6% 46.8% 0.9% 25.5% 1.4% 0.9% 50.1% 0.4% 32.0% 0.8% 0.1% 50.2% 0.8% 26.7% 1.9% 0.2% 48.5% 0.8% 25.0% 1.6% 0.4% 51.8% 1.1% 26.9% 1.9% 0.1% 48.8% 0.8% 26.0% 1.6% 0.4% 0.4% 0.4% 0.4% 0.6% 0.3% 0.3% 0.5% 0.5% 0.4% 5.6% 0.7% 6.4% 0.6% 6.8% 0.7% 8.7% 0.5% 7.3% 0.7% 5.3% 0.7% 5.7% 0.6% 5.6% 0.6% 6.3% 0.6% 112 The distribution of all PoS categories in news publishing sites and blogs is very similar, because the posts published in these sources have a similar writing style, as there are no limitations on the size of such posts. In addition, the sources not classiﬁed (i.e. “other”) have a similar distribution to the combination of all sources. This may be due to the heterogeneity of the publications contained in the web pages that have not been classiﬁed as speciﬁc content type. Next, we discuss some relevant insights obtained from the distribution of each PoS category. 6.2.1 Distribution of Nouns As shown in Table 6.1 the distribution of common and proper nouns is very diﬀerent for forums and reviews. It seemed strange to us that proper nouns, found in the sources where discussions about speciﬁc product models are raised, were less used than in the other sources. After examining a sample of 100 texts, we noticed that in those sources, product names are often written in lower case, which lead to an incorrect PoS annotation. After reprocessing the corpus using gazetteers, including proper names in lower case, we found that this is a problem with TreeTagger precision. Such problem makes entity recognition less accurate, when such entity recognition requires a previous step of detecting proper nouns using PoS tagging. Although the use of gazetteers improves entity detection, this solution is domain-dependent. In addition, foreign words are less used in news than in other sources, because the style rules of traditional media require avoiding such foreign words, as far as possible, whenever a Spanish word exists. Finally, the relative big distribution of letters of the alphabet category is due to a TreeTagger accuracy error (overall when analysing short texts published in Twitter). 6.2.2 Distribution of Adjectives As shown in Table 6.1, the distribution of adjectives of quantity is near 50% for most of the sources (adding quantity ordinal, quantity cardinal, and other). The 113 adjectives of quantity commonly used are the cardinals and the less used are the ordinals, whose use is insigniﬁcant in all sources, except in news publishing sites. The rest of quantifying adjectives (quantity others) are used quite frequently in forums and reviews, because such sites include publications of quantitative evaluations and comparisons of products. Speciﬁcally, in these sites, we ﬁnd multiplicative (e.g. doble, triple), partitive (e.g. medio, tercio), and indeﬁnite quantity adjectives (e.g. mucho, poco, bastante). 6.2.3 Distribution of Adverbs The adverbs of negation (e.g. jam´ as, nada, no, nunca, tampoco) are used with more frequency in the sources with limitations of posts length. Moreover, there is an inverse correlation between the size of the texts and the use of adverbs of negation. The detection of such negations is essential when performing sentiment analysis, since they reverse the sentiment of the opinion about speciﬁc entities. 6.2.4 Distribution of Determiners Determiners are used to a lesser extent in microblogs than in the other media types (overall un news and blogs), because the limitation of post length (e.g. 140 characters in Twitter) requires that posts are written more concisely, and therefore meaningless grammatical categories tend to be used less. 6.2.5 Distribution of Conjunctions With respect to conjunctions, the distribution of coordinating conjunctions is higher in sources where the texts are longer (i.e. news and blogs), and lower in sources were posts are shorter, especially in forums and reviews because these sources have a question-answer structure dominated by short sentences. Coordinating conjunctions are useful for opinion mining to identify opinion chunks, as well as punctuation marks. 114 6.2.6 Distribution of Pronouns The distribution of personal pronouns (e.g. yo, t´ u, m´ı) is higher in microblogs, reviews, forums and audio-visual content publishing sites because, in these sources, conversations between the users that generate the content are predominant, in contrast to the narrative style of news and blogs articles. Generally, pronouns make it diﬃcult to identify entities within opinions, because such entities are not explicitly mentioned when using pronouns. 6.2.7 Distribution of Prepositions As happened with determiners, prepositions are used to a lesser extent in microblogs than in the other media types, because of the use of a concise language. 6.2.8 Distribution of Punctuation Marks Full stops are less used in news than in other sources, because longer sentences are published in news articles which require other kinds of punctuation marks (e.g. comma), in comparison to the rest of social media sources, where concise phrases ﬁnished are usually written, which implies a bigger density of full stops. The use of comma is lower in sources where there is less writing, that is, on Twitter and sites with comments on audio-visual content. The heavy use of the colon and slash in microblogs is due to the inclusion of these characters in the emoticons and the sources cited through links embedded in tweets. Ellipses are more used in microblogs than in the rest of the sources, because of the limitation of the size of the messages. In this source, unﬁnished messages are posted frequently, so ellipses are added to express that such messages are incomplete. Furthermore, some Twitter clients truncate messages longer than 140 characters, and automatically add the ellipsis. Finally, parenthesis and other non-commonly used punctuation marks (e.g. percent sign) are less used in microblogs, because of the limited length of the tweets and the diﬃculty for introducing these characters on mobile terminals. 115 6.2.9 Distribution of Verbs With respect to verbs, in forums and microblogs its use is more extensive, in proportion to the rest of the PoS categories, than in the other social media sources. A reason for this may be that intentions and actions are expressed more often in these sources. In addition, there is less use of the past participle within microblogs than in other sources. This is because microblogs are used to transmit immediate experiences, so most of the posts are communicated in the present tense. Similarly, the inﬁnitive is more used in microblogs for lexical verbs. Finally, lexical ﬁnite verbs are used similarly in all the social media channels. 6.3 Hypothesis Validation We have demonstrated that the distribution of PoS categories varies across different social media types, which validates Hypothesis 1. Since PoS tagging is a previous step for many NLP techniques, the performance of such techniques may vary according to the social media source from which the user-generated content has been extracted. As an example, a disambiguation strategy for topic identiﬁcation may use nouns as context for performing disambiguation. Thus, sources with a higher distribution of nouns will provide more context than sources in which such distribution is smaller. The proportion of other categories may have impact over the performance of other techniques (e.g. adjectives and adverbs over sentiment analysis). 116 Chapter 7 TECHNIQUE FOR UNIQUE USER IDENTIFICATION BASED ON EVOLVING DEVICE FINGERPRINT DETECTION As we have explained in Section 2.2.2.2, any technique for identifying users based on browser ﬁngerprint must be accompanied with an algorithm to detect diﬀerent ﬁngerprints corresponding to a single browser, because browser ﬁngerprint changes very often [Eckersley, 2010]. This chapter describes a novel technique that takes into account the temporal evolution of ﬁngerprints, as well as the entropy of ﬁngerprint attributes for weighting the importance of each ﬁngerprint attribute according to its discriminative power. This technique consists in the instantiation of a set of activities deﬁned by the CRISP-DM methodology [Shearer, 2000]. Such activities are the following: 1. The Data Understanding activity collects the ﬁngerprint data and analyses them from diﬀerent perspectives, ensuring that they are valid for model learning purposes. This activity is explained in Section 7.1. 117 2. The Data Preparation activity covers all the tasks required to construct the dataset used for learning and evaluating the technique, including ensuring that users are uniquely identiﬁed and removing non-human activity from it. This activity is explained in Section 7.2. 3. The Modelling activity consists in selecting the modelling technique and in learning the speciﬁc models that will be used for identifying unique users. This activity is explained in Section 7.3. 4. The Evaluation activity consists in evaluating the models obtained. This activity is explained in Section 7.4. Next, each of the activities are described. After that, in Section 7.5 we validate the hypothesis formulated in Section 3.4 regarding unique user identiﬁcation through device’s ﬁngerprint. 7.1 Data Understanding Activity This activity consists in the ordered execution of the following tasks: 1. The Collect Initial Data task consists in obtaining the activity produced in websites. This task is described in Section 7.1.1. 2. The Describe Data task consists in performing a description of the format and volume of the data gathered. This task is described in Section 7.1.2. 3. The Explore Data task consists in performing a deeper statistical analysis of data from several viewpoints to ensure that the data are valid for modelling purposes. This task is described in Section 7.1.3. 4. The Verify Data Quality task consists in examining the quality of the data by attending to the analyses performed in the previous tasks. This task is described in Section 7.1.4. 118 7.1.1 Collect Initial Data Task This task consists in collecting the activity produced by users in websites as well as in collecting their ﬁngerprints. Such ﬁngerprints are made of a set of values for several HTTP headers [Fielding and Reschke, 2014b] and other attributes accessible by executing JavaScript [ECMA, 2011], Java or Flash code within the browser. This task gathers the same HTTP headers as Eckersley [2010] (User-Agent, User-Agent, Accept, Accept-Language, Accept-Encoding, and Accept-Charset). Such headers have been described in Section 2.2.2.2. In addition, this task collects the values for the additional HTTP headers described next. X-Real-IP header. This non-standard header identiﬁes the IP address [Postel, 1981] of the user’s device. The Nginx reverse proxy [Reese, 2008], which is used in our implementation, adds this header. This reverse proxy receives every message sent from the web browser, and redirects it to the tracking server, which processes and persists the activity record. X-Forwarded-For header. This header is a multivalued attribute that includes the IP address of the web browser machine, as well as the IP addresses of the successive proxy servers that have routed the HTTP message [Reese, 2008]. The Nginx proxy also adds this header. Cache-Control header. This header is used to specify directives that must be obeyed by all caching mechanisms along the HTTP request/response chain. Unlike the approach followed by Eckersley [2010], our work does not make use of the Cookies Enabled attribute. The rest of the attributes (Plugins, Fonts, Video, Time Zone, Session Storage, Local Storage, and IE Persistence) have been collected by using a technique implemented by Eckersley [2010], which consists on the execution of a combination of JavaScript, Java and Flash code. To obtain the Plugins attribute it is necessary to distinguish the user browsers, since this conditions the way in which this information is accessed. 119 • In the case of the Mozilla Firefox65 , Google Chrome66 , Apple Safari67 , and Opera68 browsers, this attribute is obtained through the DOM (Document Object Model) by accessing to the navigator.plugins element. Such element contains an array of objects and each object contains the name, the description, and the version of a plugin. Listing 7.1 shows the JavaScript code for obtaining the Plugins attribute for these browsers. • In the case of Internet Explorer a diﬀerent technique is applied because most versions of this browser do not include plugin information in its DOM. Such technique relies on the PluginDetect JavaScript library69 that receives a lost of the plugins for being detected and returns the information related to these plugins. Speciﬁcally, we have obtained information for the following plugins: Java, QuickTime70 , DevalVR71 , Shockwave72 , Flash, Windows Media Player73 , Silverlight74 , and Acrobat75 . The Fonts attribute is obtained through a Flash component. Therefore it cannot be obtained if Flash is not installed in user’s device. To extract the fonts information from the Flash component we make use of the jQuery Flash library76 , which allows querying Flash objects from JavaScript. Listing 7.2 shows the JavaScript code for obtaining the Fonts attribute. To extract video information we access the screen object included in the browsers’ DOM. Speciﬁcally, we obtain the values for the following attributes: • The attribute height, which contains the number of vertical pixels in the device’s screen. 65 http://www.mozilla.org/firefox http://www.google.es/chrome/browser 67 http://www.apple.com/safari 68 http://www.opera.com 69 http://www.pinlady.net/PluginDetect 70 http://www.apple.com/quicktime 71 http://www.devalvr.com 72 http://www.adobe.com/shockwave 73 http://windows.microsoft.com/en-us/windows/windows-media-player 74 http://www.microsoft.com/silverlight 75 http://www.adobe.com/products/acrobat.html 76 http://jquery.lukelutman.com/plugins/flash 66 120 1 2 var plugins = navigator.plugins; var plist = new Array(); 3 4 5 6 7 for (var i = 0; i < plugins.length; i++) { plist [ i ] = plugins[i ]. name + ”; ”; plist [ i ] += plugins[i]. description + ”; ”; plist [ i ] += plugins[i].ﬁlename + ”;”; 8 for (var n = 0; n < plugins[i ]. length; n++) plist [ i ] += ” (” + plugins[i][n ]. description + ”; ” + plugins [ i ][ n ]. type + ”; ” + plugins[i ][ n ]. suﬃxes + ”)”; 9 10 11 12 plist [ i ] += ”. ”; 13 14 } 15 16 plist . sort () ; Listing 7.1: Script for obtaining the Plugins attribute 1 2 var fonts = ””; var obj = document.getElementById(”ﬂashfontshelper”); 3 4 5 6 7 8 if (obj && typeof(obj.GetVariable) != ”undeﬁned”) { fonts = obj.GetVariable(”/:user fonts”); fonts = fonts.replace (/,/g,”, ”); fonts += ” (via Flash)”; } 9 10 11 if (fonts == ””) fonts = ”No Flash fonts detected”; Listing 7.2: Script for obtaining the Fonts attribute • The attribute width, which contains the number of horizontal pixels in the device’s screen. • The attribute colorDepth, which contains information about the number of colours supported by user’s device. Listing 7.3 shows the JavaScript code for obtaining the Video attribute. 121 1 video = screen.width + ”x” + screen.height + ”x” + screen.colorDepth; Listing 7.3: Script for obtaining the Video attribute 1 timezone = (new Date()).getTimezoneOﬀset(); Listing 7.4: Script for obtaining the Time Zone attribute 1 2 sessionStorage . ﬁngerprint = ”yes”; sessionStorageCapability = (sessionStorage. ﬁngerprint == ”yes”) Listing 7.5: Script for obtaining the Session Storage attribute The Time Zone attribute is obtained, as in previous cases, by using JavaScript code. To do so, an instance of the object Date is created and the property timezoneOﬀset is queried. Such property returns the oﬀset in minutes of the local time zone with respect to UTC (Coordinated Universal Time). Listing 7.4 shows the JavaScript code for obtaining the Time Zone attribute. The technique for obtaining the Session Storage and Data Storage attributes consists in ﬁnding out whether the browser allows storing session or local data. To do so, the objects sessionStorage and localStorage are used. Listings 7.5 and 7.6 show the JavaScript code for obtaining these attributes. The process followed by both scripts is the following: 1. Firstly, we try to store a value in the object sessionStorage (or localStorage) for the ﬁngerprint keyword (line 1). 2. Next, we query the value for the ﬁngerprint keyword stored in the object sessionStorage (or localStorage) (line 2). (a) If the value obtained is equal to the assigned in step 1, then the browser is able to store session (or local) data. (b) Otherwise, the browser is not able to do so. The technique for obtaining the IE Persistence attribute consists in ﬁnding whether the browser lets modifying XML DOM elements. Listing 7.7 shows the 122 1 2 localStorage . ﬁngerprint = ”yes”; localStorageCapability = (localStorage. ﬁngerprint == ”yes”) Listing 7.6: Script for obtaining the Local Storage attribute 1 2 3 oDiv.setAttribute(” ﬁngerprint ”, ”yes”); oDiv.save(”oXMLStore”); ieStorageCapability = (oDiv.getAttribute(”ﬁngerprint”)) == ”yes”) Listing 7.7: Script for obtaining the IE Persistence attribute JavaScript code for obtaining this attributes. The process followed by this script is the following: 1. Firstly, we try to store a value in a div object for an attribute called ﬁngerprint (line 1). 2. Next, the div object is stored within the browser’s cache (line 2). 3. After that, we query the value for the ﬁngerprint keyword stored in the browser’s cache (line 3). (a) If the value obtained is equal to the assigned in Step 1, then the browser is able to store data within Internet Explorer cache. (b) Otherwise, the browser is not able to do so. In the experiment conducted in this thesis, we have collected the data used for the experiment by using a web tracking server based on cookies, generating records containing ﬁngerprint attributes and a user identiﬁer. These records have been obtained using the JavaScript tags technique for capturing web activity explained in Section 2.2.1.3, combined with the technique based on cookies for identifying users explained in Section 2.2.2.1. 7.1.2 Describe Data Task Regarding data format, the dataset used has the structure shown in Figure 7.1, which reﬂects the ontology elements of the Social Graph Ontology that the tech- 123 cd Unique User Identification Data Format sgo:Fingerprint sgo:xRealIP sgo:xForwardedFor sgo:userAgent sgo:accept sgo:acceptLanguage sgo:acceptCharset sgo:acceptEncoging sgo:cacheControl sgo:plugins sgo:fonts sgo:video sgo:timeZone sgo:sessionStorage sgo:localStorage sgo:iePersistence foaf:Agent 1 sgo:hasActivity * sgo:hasFingerprint 0..1 * sgo:Activity dcterms:created * sgo:hasCookie sgo:Cookie 0..1 dcterms:created Figure 7.1: Format of the data used by the technique for unique user identiﬁcation based on evolving device ﬁngerprint detection nique reads or writes, hiding those properties not required by the technique. The data format consists in a set of activity records captured by the tracking server. Each activity record corresponds to a single user and is related with a ﬁngerprint and a cookie that uniquely identiﬁes a given user. The classes and properties included in the diagram have been already described in Section 5.6. With respect to data quantity, the data used in the experiment conducted in this thesis consists in a set of 18,391 records extracted from a website, between September 28 and October 19, 2011. 7.1.3 Explore Data Task This task characterises the data from diﬀerent viewpoints to ensure that the dataset is rich enough for model training. Speciﬁcally the objective of this task is to describe the distribution of the data with respect to unique visitors, web browsers used and countries of origin of the activity collected, and to study the characteristics of the ﬁngerprint attributes from an Information Theory [Shannon and Warren, 1949] perspective. Next, we characterise the data used in our experiment according to the previous guidelines. During the period of the study, 10,834 unique visitors visited the website from 124 www.pocketinvaders.com 28 Sep 2011 - 19 Oct 2011 Visitors Overview Visitors 1,000 1,000 500 500 0 0 3 Oct 10,501 people visited www.pocketinvaders.com 10 Oct 17 Oct this site 2011of- 19 Oct 2011 Figure 7.2: Daily distribution of visitors during28 theSep period study Traffic Sources Overview 11,932 Visits Visits 1,000 1,000 10,501 Absolute Unique Visitors 500 500 18,425 Pageviews 0 0 1.54 Average Pageviews 3 Oct 10 Oct 17 Oct All traffic sources sent a total of 11,932 visits www.pocketinvaders.com 00:01:00 Time on Site 28 Sep 2011 19 Oct 2011 Figure 7.3: Daily distribution of visits during the period of- study Content Overview 8.40% Direct Traffic Search Engines 78.58% Bounce Rate 2,000 1,000 Pageviews Referring Sites 15.61% Referring Sites 83.74% New Visits Direct Traffic 75.96% Search Engines 2,000 1,000 Other 0 0 Technical Profile 3 Oct 10 Oct Top Traffic Sources Browser Pages on this site were viewed a total of 18,425 times 17 Oct Visits % visits Figure 7.4: Daily distribution of pageKeywords views during the periodVisits of study% visits Sources Visits % visits 18,425 Pageviews which we15,400 have Unique extracted the data, distributed daily as shown in Figure 7.2. Such Views visitors may include humans and web crawlers. These users made a total of 11,932 Bounce Rate site visits78.59% distributed daily as shown in Figure 7.3. The visitors registered a total of 18,391 web page views distributed daily as shown in Figure 7.4. Each web Top Content page view generates a record within the ﬁngerprint log used. Pageviews On average, each visitor viewed 1.7 pages, remaining about one minute% Pageviews average time on the website. The bounce rate (i.e. percentage of visitors leaving the site after viewing a single web page) was 79%, while the percentage of new visitors was 84%, so there is a percentage of about 16% of users who visited the website before beginning the study. The minimum number of web pages viewed by a single visitor was 1, while the maximum was 389. Figure 7.5 shows the distribution of web pages viewed by single user. Table 7.1 shows the summary statistics relating to the distribution of the number of records captured by a single user. It includes Pages 125 Figure 7.5: Distribution of the activity records captured by unique user Statistic Count Mean Standard Deviation Coeﬃcient of Variation Minimum Maximum Range Systematic Error Kurtosis Value 10,834 1.7 5.34 314.5% 1 389 388 1,921.62 60,503.1 Table 7.1: Statistics associated to the number of records gathered per unique user measures of tendency, variability and shape. With respect to web browsers, there is a representation of the most used browsers in the sample (39% of the activity was generated by Google Chrome, 30% by Mozilla Firefox, 18% by Internet Explorer, 6% by Android77 , 3% by Apple Safari, and 4% by other non-catalogued browsers). This distribution aﬀects the diversity of values for diﬀerent attributes, such as the User-Agent header, or the plugins installed. The sample used in our experiment contains activity generated in 63 diﬀerent countries, as reﬂected in Figure 7.6. 77 http://www.android.com 126 www.pocketinvaders.com 28 Sep 2011 - 19 Oct 2011 Map Overlay Visits 1 5,277 11,932 visits cameFigure from 63 countries/territories 7.6: Distribution of visits per country Visits Table 11,932 Pages/Visit Avg. Time on Site % New Visits Bounce Rate 7.2 shows by the83.80% users of the 1078.58% countries with 1.54the activity generated 00:01:00 100.00% 1.54 from which 00:01:00 83.74% 78.58%shows the more visits to the site data has been extracted. The table Visits number of visits per visitor, the average page views per visit, the average time 5,277 percentage of new pages viewed per visitor, spent on the website per visitor, the 2,173 and the bounce rate. 904 The distribution of countries aﬀects diﬀerent ﬁngerprint attributes, such as 874 the time zone and the Accept-Language header. 578 Table 7.3 shows the entropy [Shannon, 1948] of each ﬁngerprint attribute. 360 The ﬁrst column shows the variable 325name assigned to the attribute. The second column shows the attribute itself.173The third column indicates the entropy of 151 the attribute in our dataset. The fourth column shows the entropy obtained by 130 Eckersley [2010] for the same attributes. The entropy associated with headers X-Real-IP, X-Forwarded-For and Cache-Control was not studied by Eckersley [2010], while the attribute that indicates whether the browser supports cookies has not been used in our work. On the other hand, the entropy associated with Accept HTTP headers was studied jointly by Eckersley [2010], whereas in our work it has 127 Country Spain Mexico Argentina Chile Colombia Venezuela Peru Unknown USA Ecuador Visits 5277 2173 904 874 578 360 325 173 151 130 Pages 1.73 1.34 1.41 1.49 1.40 1.66 1.37 1.51 1.42 1.38 Time Spent 1 20 41 39 53 45 59 48 1 47 45 32 New Visits 75.67% 90.43% 87.94% 92.11% 89.79% 90.28% 90.46% 91.91% 93.38% 93.08% Bounce Rate 75, 52% 84, 49% 79, 98% 79, 41% 80, 28% 70, 28% 80, 62% 82, 08% 78, 15% 81, 54% Table 7.2: Distribution of visits for the 10 countries that generated more site activity X X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 −− Attribute X-Real-IP X-Forwarded-For User-Agent Accept Accept-Language Accept-Charset Accept-Encoding Cache-Control Plugins Fonts Video Time zone Session storage Local storage IE persistence Cookies enabled H(X) 12,5061 12,52 7,51458 2,05302 3,68173 1,89086 1,81318 0,299063 11,7677 8,38331 5,50273 2,30895 0,299995 0,297941 0,560692 – H(X) [Eckersley, 2010] – – 10 6,09 – 15,4 13,9 4,83 3,04 2,12 0,353 Table 7.3: Entropy of ﬁngerprint attributes been studied separately; the same happens with browser storage capabilities. The quantitative diﬀerences between the entropy values of obtained in our work and those obtained by Eckersley [2010] are due the size of the datasets; the longer dataset of Eckersley [2010] contains data from 470,161 browsers, whereas our dataset contains 10,834. 128 e Lo ca IE p ersis tenc Sessi l st o rage ge Tim o n st ora Vide X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 – 12,6 12,7 12,6 12,6 12,6 12,5 12,7 12,7 12,6 12,5 12,5 12,5 12,5 – 7,8 8,3 7,5 7,6 7,7 12,3 11,1 9,7 9 7,5 7,5 7,5 – 4,2 2,6 3 2,3 12 9,3 6,3 4,2 2,3 2,3 2,9 – 3,8 4,2 3,9 12 9,8 7,6 5,4 3,8 3,8 3,9 – 2,4 2,2 11,8 9,2 6,3 4,1 2,1 2,1 2,2 – 2,1 11,9 9,2 6,4 4,1 2 2 2,2 – 11,8 8,5 5,7 2,6 0,6 0,6 0,9 – 12 12,3 12 11,8 11,8 11,8 – 10,7 9,4 8,5 8,5 8,5 – 7,4 5,7 5,7 5,9 – 2,6 2,6 2,9 – 0,3 0,8 – 0,8 – o Font s X2 ins Plug e zon e ntrol Cach e-Co ing nco d Acce pt-E et hars Acce pt-C angu X1 – 12,5 12,6 12,6 12,6 12,6 12,6 12,5 12,7 12,7 12,6 12,5 12,5 12,5 12,5 pt Acce pt-L nt Acce -Age ded- User rwar X-Fo H(X, Y ) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X-Re al-IP For age Table 7.4 shows the cross-entropy values when pairs of ﬁngerprint attributes are combined. The pairs of ﬁngerprint attributes with more discriminative power are the X-Real-IP header combined with the plugins or the font, and the XForwarded-For header combined with the plugins, the fonts, or the Accept header. The cross-entropy values of X-Real-IP and X-Forwarded-For are quite similar because the value of the former is always included within the value of the latter, and most of ﬁngerprint records do not correspond to a proxy route. Table 7.4: Cross-entropy between pairs of ﬁngerprint attributes 129 e Lo ca IE p ersis tenc Sessi l st o ra ge e Tim o n st orag Vide e zon e Font s X3 5,13 5,13 – 0,26 0,82 0,03 0,13 0,15 4,83 3,58 2,23 1,47 0,01 0,01 0,01 X4 10,59 10,60 5,73 – 2,2 0,57 0,94 0,28 9,94 7,24 4,24 2,19 0,23 0,22 0,44 X5 8,88 8,89 4,65 0,57 – 0,13 0,49 0,26 8,33 6,14 3,9 1,76 0,16 0,15 0,19 X6 10,66 10,67 5,65 0,74 1,92 – 0,55 0,28 9,96 7,31 4,44 2,23 0,22 0,22 0,26 X7 10,74 10,75 5,83 1,18 2,36 0,63 – 0,27 10,07 7,36 4,63 2,24 0,23 0,23 0,41 X8 12,22 12,23 7,36 2,04 3,64 1,88 1,78 – 11,51 8,21 5,44 2,29 0,3 0,29 0,56 X9 0,91 0,91 0,58 0,23 0,24 0,08 0,12 0,04 – 0,22 0,53 0,27 0,05 0,05 0,01 X10 4,3 4,3 2,71 0,91 1,44 0,82 0,79 0,13 3,6 – 2,28 1,02 0,15 0,14 0,13 X11 7,13 7,14 4,24 0,79 2,08 0,83 0,94 0,24 6,79 5,16 – 1,92 0,16 0,16 0,39 X12 10,21 10,22 6,68 1,94 3,13 1,81 1,75 0,28 9,73 7,09 5,12 – 0,27 0,27 0,55 X13 12,21 12,23 7,22 1,98 3,54 1,82 1,75 0,3 11,52 8,23 5,36 2,28 – 0,01 0,54 X14 12,21 12,23 7,22 1,98 3,54 1,81 1,75 0,3 11,52 8,23 5,36 2,28 0,01 – 0,54 X15 11,95 11,97 6,96 1,93 3,31 1,59 1,67 0,3 11,21 7,95 5,33 2,3 0,28 0,28 – o Plug X2 0 – 0,13 0,13 0,05 0,04 0,04 0,01 0,16 0,17 0,12 0,01 0,01 0,01 0,01 ins Cach e-Co ntrol ing nco d Acce pt-E hars et age Acce pt-C Acce pt-L Acce X1 – 0,01 0,13 0,13 0,05 0,05 0,05 0,01 0,17 0,17 0,13 0,01 0,01 0,01 0,01 pt User -Age nt angu For dedrwar X-Fo al-IP X-Re H(X|Y ) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 Table 7.5: Conditional entropy between pairs of ﬁngerprint attributes Finally, Table 7.5 shows the entropy of every ﬁngerprint attribute when the value of another attribute is known (i.e. conditional entropy). The columns in the table correspond to the attribute Y known, while the rows indicate the attribute X whose entropy we want to know, given a known value of Y . As it can be seen in the table, there is not uncertainty for the attribute X-Real-IP when the header X-Forwarded-For is known. This is because the value of the former is always included in the value of the latter. In addition, many ﬁngerprint attributes provide a few information over others (e.g. there is not much uncertainty for the time zone attribute when the value of the header X-Forwarder-For is known). 7.1.4 Verify Data Quality Task The study of the dataset used in our experiment shows that the data is assorted enough to perform model training, from the point of view of records per unique user (from 1 to 389), web browsers and countries of origin. In addition, as shown with the study of the entropy, any variable is not enough by itself for determining unique users, neither any combination of variables. Therefore, the dataset will be useful for stressing the model in order to demonstrate its classiﬁcation power. 130 7.2 Data Preparation Activity This activity consists in the ordered execution of the following tasks: 1. The Select Data task consists in deciding the data to be used for the analysis, removing from the dataset the ﬁngerprint records that may conduct to deﬁciencies in the model resulting from the learning phase. This task is described in Section 7.2.1. 2. The Clean Data task consists in cleansing the dataset in order to ensure that it contains activity records corresponding to human agents uniquely identiﬁed. This task is described in Section 7.2.2. 3. The Construct Data task consists in performing data transformations to the values of some of the ﬁngerprint attributes gathered. This task is described in Section 7.2.3. 7.2.1 Select Data Task As the goal of this technique is to uniquely identify users from web activity records, the records used for model learning must contain the activity of users uniquely identiﬁed, i.e. the dataset must not contain activity records assigned to multiple identiﬁers that correspond to the same user. Additionally, the dataset must not contain non-human activity (i.e. records generated by robots). Therefore, the activity corresponding to users with multiple identiﬁers and the activity generated by non-human agents must be removed. This is performed in the task described next. 7.2.2 Clean Data Task This task cleans the dataset in order to satisfy the selection criteria identiﬁed in the previous section. As the users in the dataset used in the experiment conducted in this thesis have been collected using the technique based on cookies, users may have been identiﬁed more than once, due to the problems identiﬁed in Section 2.2.2.1. To 131 Search Engine Google Bing Yahoo! User-Agent Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html) Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Table 7.6: User-Agent values for Google, Bing, and Yahoo! robots deal with this issue we have only taken into account the users identiﬁed before the data-gathering period by removing the activity records of those users that were ﬁrstly identiﬁed after the initial gathering date. For doing so, all the activity records that are related to a cookie that has been created after the initial date have been removed. This data cleansing action allows to evaluate the performance of the technique with respect to a gold standard based on cookies that do not include multiple identiﬁers for single users. Finally, since the technique is focused on users, this task ﬁlters the activity generated by web crawlers. To do this, it discards the activity records with a User-Agent header whose value is recognised as a robot. For example, 7.6 show the values corresponding to Google, Bing, and Yahoo! robots. Since not all robots are identiﬁed through the User-Agent header, this technique implements an additional mechanism that consists in ﬁltering the activity produced by agents that perform more than 3 requests every 0.5 seconds. To determine that two records are from the same agent, such records must be identical. In the experiment conducted in this thesis, by using this method, we have ﬁltered 73 records produced by crawlers. 7.2.3 Construct Data Task For each ﬁngerprint record, this task stores the attribute values within a database, according to the format explained in Section 7.1.2. We apply a compression function to the values of the Plugins and Fonts attributes, so they can be included within the parameters of the HTTP GET requests [Fielding and Reschke, 2014b] that are sent from the browser to the tracking server, as the data obtained from such attributes can be extensive. The compression function used by this technique is the cryptographic hash function SHA-1 [Eastlake and Jones, 2001]. Similar oneway functions could be applied to other attributes for avoiding persisting personal data (e.g. IP addresses, time zones, etc.), thus warranting users’ privacy. 132 7.3 Modelling Activity This activity consists in the ordered execution of the following tasks: 1. The Select Modelling Technique task consists in selecting and describing a modelling technique for begin applied for unique user identiﬁcation purposes. This task is described in Section 7.3.1. 2. The Generate Test Design task consists in deﬁning the approach followed for evaluating the technique. This task is described in Section 7.3.2. 3. The Build Model task consists in learning the model used for identifying unique users. This task is described in Section 7.3.3. Next, each of these tasks are described. 7.3.1 Select Modelling Technique Task This section describes the classiﬁcation approach (i.e. the modelling technique) used for unique user identiﬁcation. We have adapted the early binding algorithm introduced in Section 2.3.1. The input of this algorithm is a sequence of ﬁngerprints R ordered by timestamp ascending. The output of the algorithm is a set of clusters C, in which each cluster C ∈ C, includes ﬁngerprints in R identiﬁed as belonging to the same browser. Listing 7.8 formalises the algorithm proposed. The steps executed in the algorithm are explained next. 1. Firstly, we initialise the set of clusters C at the empty set (line 3). 2. Next, for each ﬁngerprint ri we calculate the maximum similarity between such ﬁngerprint and each cluster Cj generated so far (line 5). Similarity computation between clusters and ﬁngerprint is explained in Section 7.3.1.2. (a) If the maximum similarity is greater or equal than a threshold θ, then there exists a cluster C to which we can add the ﬁngerprint ri that is been processed, so we execute the following steps (lines 6-8): i. Obtain the cluster that is more similar to the ﬁngerprint (line 6). 133 ii. Add the ﬁngerprint to such cluster (line 7). iii. Update cluster signature (line 8). Such signature is used to compare candidate ﬁngerprints with the cluster. Section 7.3.1.1 describes the steps that must be followed for updating the signature. (b) If the maximum similarity is less than the threshold θ, then there does not exist a cluster C to which we can add the ﬁngerprint, so we execute the following steps (lines 10-12): i. Create a new cluster C and add the ﬁngerprint ri to it (line 10). ii. Add the cluster C to the set of clusters C (line 11). iii. Generate a new signature for cluster C (line 12). Section 7.3.1.1 describes the steps that must be followed for creating the signature. 3. Finally, the set of clusters C is returned (line 15). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 function ClusterF ingerprints(R) begin C⇐∅ for each ri ∈ R do if maxCj ∈C sim(ri , Cj ) ≥ θ then C ⇐ arg maxCj ∈C sim(ri , Cj ) C ⇐ C ∪ {ri } U pdateSignature(ri , C) else C ⇐ {ri } C ⇐ C ∪ {C} CreateSignature(ri , C) end if end for return C end Listing 7.8: Algorithm for clustering ﬁngerprints of the same browser 134 7.3.1.1 Cluster Signature The signature of a cluster allows obtaining similarities between the clusters and candidate ﬁngerprints for being included in such clusters. Such signature is a tuple (V, Te , Tl ), in which: • V = (C.X1 , ..., C.Xi , ..., C.X15 ) is a sequence, in which each component corresponds with the value observed for the attribute Xi ∈ X in the last ﬁngerprint added to the cluster C, where X is the set of attributes. • Te = (te (C.X1 ), ..., te (C.Xi ), ..., te (C.X15 )) is a sequence, in which each component corresponds with the timestamp of the ﬁrst observation of the value C.Xi for the attribute Xi within a ﬁngerprint added to the cluster C. • Tl = (tl (C.X1 ), ..., tl (C.Xi ), ..., tl (C.X15 )) is a sequence, in which each component corresponds with the timestamp of the last observation of the value C.Xi for the attribute Xi within a ﬁngerprint added to the cluster C. Next, the operations for creating and updating clusters signatures are described. Signature creation. Listing 7.9 details the operation for creating a cluster signature. The inputs of this operation are the ﬁngerprint r and the cluster C, whose signature will be created from r. When this operation is executed, the cluster C contains only the ﬁngerprint r. Thus the ﬁrst time that the value of an attribute Xi is observed for the cluster C (i.e. te (C.Xi )) corresponds to the timestamp of ﬁngerprint creation r.t, as happens with the last time that the value of an attribute Xi is observed for the cluster C (i.e. tl (C.Xi )). Signature updating. Listing 7.10 details the operation for updating a cluster signature. The inputs of this operation are the ﬁngerprint r and the cluster C, whose signature we want to update from r. In this operation, for each ﬁngerprint attribute Xi we execute the following steps (lines 4-13): 1. Compute the similarity between ﬁngerprint attribute value r.Xi and cluster attribute value C.Xi . The similarity computation between attributes is deﬁned in Section 7.3.1.2. 135 1 2 3 4 5 6 7 8 procedure CreateSignature(r, C) begin for each Xi ∈ X do te (C.Xi ) ⇐ r.t tl (C.Xi ) ⇐ r.t C.Xi ⇐ r.Xi end for end Listing 7.9: Operation for creating a cluster signature (a) If the similarity s is less than a threshold θl , we consider that the value of the attribute has changed. Thus, we assign the timestamp of ﬁngerprint creation r.t to te (C.Xi ) (lines 5-6). We have considered θl = 0.5 in our experiment. (b) If the similarity s is greater than a threshold θh , we consider that the value of the attribute has not changed. Thus, we maintain the value of te (C.Xi ) (lines 7-8). We have considered θh = 0.9 in our experiment. (c) If θl ≤ s ≤ θh , we consider that the attribute maintains its value with probability s, so it changes its value with probability 1 − s. Thus we estimate the instant of time in which the attribute changed its value by combining the timestamp of current attribute value, with ﬁngerprint creation timestamp as shown in line 10. 2. Assign the ﬁngerprint creation timestamp r.t to tl (C.Xi ) (line 12). 3. Assign attribute value Xi of ﬁngerprint r (i.e. r.Xi ) to the cluster signature for the attribute Xi (i.e. C.Xi ) (line 13). 136 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 procedure U pdateSignature(r, C) begin for each Xi ∈ X do s ⇐ sim(r.Xi , C.Xi ) if s < θl then te (C.Xi ) ⇐ r.t else if s > θh then (∗ The previous value of te (C.Xi ) is maintained ∗) else te (C.Xi ) ⇐ s · te (C.Xi ) + (1 − s) · r.t end if tl (C.Xi ) ⇐ r.t C.Xi ⇐ r.Xi end for end Listing 7.10: Operation for updating a cluster signature 7.3.1.2 Similarity Computation The similarity between a ﬁngerprint r and a cluster C is calculated as the weighted average of the similarities between the values of each of the attributes in the ﬁngerprint and the values of the same attributes in the signature of the cluster (see Equation 7.1). Section 7.3.1.3 explains the diﬀerent alternatives for obtaining the weights wX . sim(r, C) = X∈X sim(r.X, C.X) · wX X∈X wX (7.1) With respect to similarity between ﬁngerprint attribute values and cluster signature attribute values, the similarity measure considered for the most part of the attributes is the equality (see Equation 7.2). sim(r.X, C.X) = 1 r.X = C.X 0 r.X = C.X (7.2) When the ﬁngerprint attribute being compared is X2 (i.e. X-Forwarded-For header), we apply the index proposed by Jaccard [1901] for measuring the similar- 137 ity between sets (see Equation 7.3), as suggested by Li et al. [2011] for multivalued attributes. sim(r.X2 , C.X2 ) = Jaccard(r.X2 , C.X2 ) = r.X2 ∩ C.X2 r.X2 ∪ C.X2 (7.3) Finally, if the ﬁngerprint attribute being compared is X3 (i.e. User-Agent header), we apply a similarity calculated from the normalized Levenshtein [1966] distance shown in Equation 7.4. Such distance is appropriate for the attribute X3 , because the value of the User-Agent header changes slightly over time, due to browser or operating system version updates. sim(r.X3 , C.X3 ) = 1 − Levenshtein(r.X3 , C.X3 ) max length(v) (7.4) v∈{r.X3 ,C.X3 } 7.3.1.3 Attribute Weight Computation The algorithm described above has been tested with four diﬀerent variants. These alternatives consists in using diﬀerent weights to ponder similarity computation between values of the ﬁngerprint attributes and cluster signature attributes. Next, the four variants are described. Variant based on uniform weights. The ﬁrst alternative is the most simple and consists in assigning the same weight for all ﬁngerprint attributes, as shown by Equation 7.5. wX = 1 (7.5) Variant based on attribute entropy. The second variant consists in using the entropy of the attribute as the attribute weight, as shown by Equation 7.6. wX = H(X) (7.6) Variant based on time decay. The third variant takes into account attribute agreement and disagreement decays. Equation 7.7 shows how to calculate 138 attribute weight according to this variant for single-valued attributes (i.e. all attributes with the exception of X-Forwarded-For header). ⎧ ⎪ ⎨ 1 − d= (X, Δtl ) s > θh wX = 1 − d= (X, Δte ) s < θl ⎪ ⎩ = = 1 − s · d (X, Δtl ) − (1 − s) · d (X, Δte ) θl < s < θh (7.7) As deﬁned by Li et al. [2011], given a similarity s = sim(r.X, C.X) between two values of an attribute, with probability s, the two values are the same and we shall use the complement of the agreement decay as attribute weight. On the other hand, with probability 1 − s, the values are diﬀerent and we shall use the complement of the disagreement decay as attribute weight. Thus attribute weight is computed by combining the complements of agreement and disagreement decays. For high similarity values (i.e. s > θh = 0.9), we only use the complement of agreement decay, while for low similarity values (i.e. s > θl = 0.5), we only use the complement of disagreement decay. With respect to the time periods Δt used for computing disagreement decay, we take into account the time lapsed between ﬁngerprint capturing r.t and the ﬁrst time that the current attribute value was observed in the cluster, as shown in Equation 7.8. Δte = |r.t − te (C.X)| (7.8) On the other hand, for computing agreement decay, we take into account the time lapsed between r.t and the last time that the current attribute value was observed in the cluster, as shown in Equation 7.9. Δtl = |r.t − tl (C.X)| (7.9) Finally, for the X-Forwarded-For header, we only take into account agreement decay, as explained by Li et al. [2011], since such header is a multi- 139 valued attribute. Thus, in such case, we calculate he attribute weight as shown in Equation 7.10. wX2 = 1 − d= (X2 , Δtl ) (7.10) Variant based on attribute entropy and time decay. The last variant takes into account both attribute evolution and entropy. Therefore, the attribute weights are obtained by multiplying the weight obtained according to the previous variant by attribute entropy, as shown in Equation 7.11 for singlevalued attributes. ⎧ ⎪ ⎨ H(X) · (1 − d= (X, Δtl )) s > θh = wX = H(X) · (1 − d (X, Δte )) s < θl ⎪ ⎩ H(X) · (1 − s · d= (X, Δtl ) − (1 − s) · d= (X, Δte )) θl < s < θh (7.11) Finally, for the X-Forwarded-For header we calculate the attribute weight as shown in Equation 7.12. wX2 = H(X2 ) · (1 − (d= (X2 , Δtl ))) 7.3.2 (7.12) Generate Test Design Task The test designed consists in performing a 2-fold cross-validation with the gold standard previously constructed. The gold standard consists in a corpus of activity records with users identiﬁed by using the technique based on cookies. We have ensured in the Clean Data task (see Section 7.2.2) that there is a unique cookie that identiﬁes every single user. The evaluation results are discussed in Section 7.4. 7.3.3 Build Model Task This task consists in learning the model used for unique user identiﬁcation. It consists in the following steps: 140 1. Obtain the entropy for each ﬁngerprint attribute. 2. Obtain the evolution parameters (i.e. agreement decay and disagreement decay) for each ﬁngerprint attribute. The result of applying Step 1 to the dataset has been shown in Table 7.3. The results of Step 2 are described next. We have implemented the algorithms described by Li et al. [2011] for learning agreement and disagreement decays. Once we have obtained the temporal values of these probabilities for each attribute, we have performed simple regression analyses, obtaining explanatory models for the agreement and disagreement decays. Each model corresponds to a function dp (X, Δt), where • p is the type of decay (d= (X, Δt) for disagreement decay and d= (X, Δt) for agreement decay), • X is the ﬁngerprint attribute, and • Δt is a time increment, such that Δt ∈ [0, ∞). The time unit of measurement that we used in our experiment is the minute, although we maintain a precision of ﬁve fractional digits for time units because users activity timestamps are deﬁned at the granularity of milliseconds. In addition, each function dp (X, Δt) complies with the properties deﬁned by Li et al. [2011] for agreement and disagreement decays: • Any value of dp (X, Δt) is deﬁned within the interval [0, 1]. • dp (X, Δt) is a monotonically increasing function. Tables 7.7 and 7.8 show agreement and disagreement decays respectively for the ﬁngerprint attributes. The attributes with faster disagreement decays include the X-Real-IP and the User-Agent headers. In the case of the X-Real-IP header, IP addresses [Postel, 1981] use to change with DCHP (Dynamic Host Conﬁguration Protocol) [Droms, 1997] assignments, mostly in mobile environments. In addition, browser versions 141 Attribute Disagreement decay ⎧ √ 0.0033855 + 0.00348067 Δt ⎨ √ −0.23349 + 0.00721289 Δt ⎩ 1 0 < Δt < 1047.895444 1047.895444 ≤ Δt < 29245.06883 Δt ≥ 29245.06883 X-Real-IP d= (X1 , Δt) = X-Forwarded-For N/A when⎧ the attribute is multivalued [Li et al., 2011] User-Agent d= (X3 , Δt) = Accept d= (X4 , Δt) = Accept-Language Accept-Charset Accept-Encoding Cache-Control d= (X ⎨ 0 −0.0047762 + 0.0000356876Δt 1 ⎩ 5 , Δt) d= (X6 , Δt) = √ (0.279051 + 0.00387899 Δt)2 1 = d= (X7 , Δt) = e−4.09968+3.23283·10 1 1 √ e−4.41392+0.0222535 Δt 1 √ (0.0835439 + 0.00570466 Δt)2 1 Plugins d= (X Fonts ⎧ (0.012879 + 0.0605308 ln Δt)2 ⎨ √ d= (X10 , Δt) = −0.230069 + 0.00692895 Δt ⎩ 1 Video d= (X11 , Δt) = 9 , Δt) = Session storage Local storage IE persistence (0.233452 + d= (X12 , Δt) = Time zone −9 Δt2 13 , Δt) Δt < 35610.94887 Δt ≥ 35610.94887 Δt ≥ 40342.93458 Δt < 39341.62217 Δt ≥ 39341.62217 Δt < 25808.56167 Δt ≥ 25808.56167 0 < Δt < 1102.506998 1102.506998 ≤ Δt < 31515.49207 √ Δt ≥ 31515.49207 0.00417302 Δt)2 Δt < 33742.54047 1 Δt ≥ 33742.54047 (0.12658 + 0.000024006Δt)2 1 √ e−5.91823+0.0303167 Δt 1 √ −6.04222+0.0306873 Δt e = d (X14 , Δt) = 1 √ −6.29214+0.0331916 Δt e = d (X15 , Δt) = 1 d= (X Δt < 34543.93180 Δt ≥ 34543.93180 (0.0281337 + 4.35781 · 10−10 Δt2 )2 Δt < 47224.69002 1 Δt ≥ 47224.69002 −4.67385+0.000115853Δt e Δt < 40342.93458 d= (X8 , Δt) = 0 < Δt < 133.8336005 133.8336005 ≤ Δt < 28154.77084 Δt ≥ 28154.77084 = Δt < 36383.40415 Δt ≥ 36383.40415 Δt < 38108.32197 Δt ≥ 38108.32197 Δt < 38768.20650 Δt ≥ 38768.20650 Δt < 35936.88071 Δt ≥ 35936.88071 Table 7.7: Disagreement decay of ﬁngerprint attributes use to be updated frequently (Google Chrome updates itself automatically), what changes the value of the User-Agent header. The attributes with slower disagreement decays include the Accept* headers. These headers tend to be stable, since they specify attributes such as the user language or the expected character encoding. The agreement decay of most of the ﬁngerprint attributes present a total linearity with fast agreement decays (the agreement decay is 1 before the 3rd minute). The attributes X-Real-IP and X-Forwarded-For grow even faster than the others, since the same IP address can be assigned unsing NAT (Network Address Translator) [Egevang, 1994] to diﬀerent machines at the same time. 142 Attribute X-Real-IP X-Forwarded-For User-Agent Agreement decay −5.59227+0.550263 ln Δt e = d (X1 , Δt) = 1 −5.06924+0.498956 ln Δt e d= (X2 , Δt) = 1 −10.1715+0.993395 ln Δt e d= (X3 , Δt) = 1 Accept d= (X4 , Δt) = Accept-Language d= (X5 , Δt) = 0.000033494 + 0.000033083Δt Δt < 30225.99238 1 Δt ≥ 30225.99238 Accept-Charset Accept-Encoding Cache-Control 0.0000347399 + 0.0000330993Δt 1 0.0000339481 + 0.0000330917Δt = d (X6 , Δt) = 1 0.0000333693 + 0.0000330786Δt = d (X7 , Δt) = 1 d= (X8 , Δt) = Fonts d= (X Video d= (X11 , Δt) = Time zone d= (X12 , Δt) = Session storage d= (X13 , Δt) = Local storage d= (X14 , Δt) = IE persistence d= (X15 , Δt) = 10 , Δt) = Δt < 30211.06972 Δt ≥ 30211.06972 Δt < 30218.03207 Δt ≥ 30218.03207 Δt < 30230.01671 Δt ≥ 30230.01671 −0.0112165 + 0.00005847Δt − 8.0021 · 10−10 Δt2 1 d= (X9 , Δt) = Plugins Δt < 25923.46755 Δt ≥ 25923.46755 Δt < 25840.37422 Δt ≥ 25840.37422 Δt < 27976.76015 Δt ≥ 27976.76015 e−10.1405+0.98999 ln Δt 1 Δt < 28104.69216 Δt ≥ 28104.69216 Δt < 28086.17546 Δt ≥ 28086.17546 0.0000332742 + 0.0000330753Δt 1 0.0000673283 + 0.0000331751Δt 1 0.0000337212 + 0.0000337212Δt 1 0.0000331289 + 0.0000330721Δt 1 0.0000331289 + 0.0000330721Δt 1 0.0000331737 + 0.0000330731Δt 1 Δt < 30233.03570 Δt ≥ 30233.03570 Δt < 30141.05976 Δt ≥ 30141.05976 Δt < 30222.05737 Δt ≥ 30222.05737 Δt < 30235.96539 Δt ≥ 30235.96539 Δt < 30235.96539 Δt ≥ 30235.96539 Δt < 30235.04982 Δt ≥ 30235.04982 Table 7.8: Agreement decay of ﬁngerprint attributes Next, the agreement and disagreement decays are described for the ﬁngerprint attributes used in this work. 7.3.3.1 X-Real-IP Header Figure 7.7 shows the values learned for the disagreement decay of the X-RealIP header in blue, while the regression model obtained is shown in green. The regression function is not deﬁned for the interval [0, 1047.89544) as for values in this interval the radicand expression produces negative numbers. For such interval the model shown in Figure 7.8 has been obtained by performing and additional regression speciﬁcally for the interval [0, 1047.89544). Joining both models, the disagreement decay the X-Real-IP header is described by Equation 7.13. 143 = d (X1 , Δt) = ⎧ √ ⎪ 0.0033855 + 0.00348067 Δt ⎨ √ ⎪ ⎩ 0 < Δt < 1047.895444 −0.23349 + 0.00721289 Δt 1047.895444 ≤ Δt < 29245.06883 Δt ≥ 29245.06883 1 (7.13) Figure 7.9 shows the model learned for the agreement decay of the X-Real-IP header, which is described by Equation 7.14. d= (X1 , Δt) = e−5.59227+0.550263 ln Δt Δt < 25923.46755 1 Δt ≥ 25923.46755 (7.14) Figure 7.7: Disagreement decay for the X-Real-IP header (second interval) Figure 7.8: Disagreement decay for the X-Real-IP header (ﬁrst interval) 144 Figure 7.9: Agreement decay for the X-Real-IP header 7.3.3.2 X-Forwarded-For Header The X-Forwarded-For header is a multivalued attribute (i.e. it contains multiple IP addresses) diﬀering from the rest of ﬁngerprint attributes, which are singlevalued (i.e. only contain one value per attribute). As stated by Li et al. [2011], for multivalued attributes only agreement decay must be learned due to the following reasons: (i) having diﬀerent values for such attributes does not indicate record un-match, and (ii) sharing the same value for such attributes is additional evidence for record match. Therefore, for the X-Forwarded-For header we have only learned its agreement decay. Figure 7.10 shows the model learned, which is described by Equation 7.15. Figure 7.10: Agreement decay for the X-Forwarded-For header 145 = d (X2 , Δt) = e−5.06924+0.498956 ln Δt Δt < 25840.37422 1 Δt ≥ 25840.37422 (7.15) As it can be observed, the model is quite similar to the one corresponding to the agreement decay for the X-Real-IP header, because the value of the X-RealIP header is always included within the values of the X-Forwarded-For header, and, in most cases, the X-Forwarded-For header includes a unique value that corresponds to the value of the X-Real-IP header. In addition, as shown in Table 7.5, the values of their conditioned entropy are very low: H(X1 |X2 ) = 0 and H(X2 |X1 ) = 0.01. 7.3.3.3 User-Agent Header Figure 7.11 shows the model learned for the disagreement decay for the UserAgent header. As it can be seen, this header changes in a lineal fashion, slower than the X-Real-IP header. Therefore it is a more stable ﬁngerprint attribute. Equation 7.16 describes the disagreement decay of the User-Agent header. = ⎧ ⎪ ⎨ d (X3 , Δt) = ⎪ ⎩ 0 0 < Δt < 133.8336005 −0.0047762 + 0.0000356876Δt 133.8336005 ≤ Δt < 28154.77084 Δt ≥ 28154.77084 1 (7.16) Figure 7.11: Disagreement decay for the User-Agent header 146 Figure 7.12: Agreement decay for the User-Agent header Figure 7.12 shows the model learned for the agreement decay of the UserAgent header, which is described by Equation 7.17. d= (X3 , Δt) = 7.3.3.4 e−10.1715+0.993395 ln Δt Δt < 27976.76015 1 Δt ≥ 27976.76015 (7.17) Accept Header Figure 7.13 shows the model learned for the disagreement decay of the Accept header, which is described by Equation 7.18. d= (X4 , Δt) = √ (0.279051 + 0.00387899 Δt)2 Δt < 34543.93180 1 Δt ≥ 34543.93180 (7.18) Figure 7.14 shows the model learned for the agreement decay of the Accept header, which is described by Equation 7.19. d= (X4 , Δt) = 0.000033494 + 0.000033083Δt Δt < 30225.99238 1 Δt ≥ 30225.99238 147 (7.19) Figure 7.13: Disagreement decay for the Accept header Figure 7.14: Agreement decay for the Accept header 7.3.3.5 Accept-Language Header Figure 7.15 shows the model learned for the disagreement decay of the AcceptLanguage header, which is described by Equation 7.20. As it can be seen in the ﬁgure, such disagreement decay grows very slowly (it is very unlikely for a browser to change its language requested to web servers). = d (X5 , Δt) = e−4.09968+3.23283·10 1 −9 Δt2 Δt < 35610.94887 Δt ≥ 35610.94887 (7.20) Figure 7.16 shows the model learned for the agreement decay of the AcceptLanguage header, which is described by Equation 7.21. 148 Figure 7.15: Disagreement decay for the Accept-Language header Figure 7.16: Agreement decay for the Accept-Language header d= (X5 , Δt) = 7.3.3.6 0.0000347399 + 0.0000330993Δt Δt < 30211.06972 1 Δt ≥ 30211.06972 (7.21) Accept-Charset Header Figure 7.17 shows the model learned for the disagreement decay of the AcceptCharset header, which is described by Equation 7.22. As happened with the previous header, such disagreement decay grows very slowly. 149 Figure 7.17: Disagreement decay for the Accept-Charset header Figure 7.18: Agreement decay for the Accept-Charset header d= (X6 , Δt) = (0.0281337 + 4.35781 · 10−10 Δt2 )2 Δt < 47224.69002 1 Δt ≥ 47224.69002 (7.22) Figure 7.18 shows the model learned for the agreement decay of the AcceptCharset header, which is described by Equation 7.23. d= (X6 , Δt) = 0.0000339481 + 0.0000330917Δt Δt < 30218.03207 1 Δt ≥ 30218.03207 150 (7.23) 7.3.3.7 Accept-Encoding Header Figure 7.19 shows the model learned for the disagreement decay of the AcceptEncoding header, which is described by Equation 7.24. As happened with the Accept-Language and Accept-Charset, such disagreement decay grows very slowly. = d (X7 , Δt) = e−4.67385+0.000115853Δt Δt < 40342.93458 1 Δt ≥ 40342.93458 (7.24) Figure 7.20 shows the model learned for the agreement decay of the AcceptEncoding header, which is described by Equation 7.25. Figure 7.19: Disagreement decay for the Accept-Encoding header Figure 7.20: Agreement decay for the Accept-Encoding header 151 d= (X7 , Δt) = 7.3.3.8 0.0000333693 + 0.0000330786Δt Δt < 30230.01671 1 Δt ≥ 30230.01671 (7.25) Cache-Control Header Figure 7.21 shows the model learned for the disagreement decay of the CacheControl header, which is described by Equation 7.26. As happened with the Accept-Language, Accept-Charset, and Accept-Encoding headers such disagreement decay grows very slowly. d= (X8 , Δt) = e−4.41392+0.0222535 1 √ Δt Δt < 39341.62217 Δt ≥ 39341.62217 (7.26) Figure 7.22 shows the model learned for the agreement decay of the CacheControls header, which is described by Equation 7.27. d= (X8 , Δt) = −0.0112165 + 0.00005847Δt − 8.0021 · 10−10 Δt2 Δt < 28104.69216 1 Δt ≥ 28104.69216 (7.27) Figure 7.21: Disagreement decay for the Cache-Control header 152 Figure 7.22: Agreement decay for the Cache-Control header 7.3.3.9 Plugins Figure 7.23 shows the model learned for the disagreement decay of the Plugins installed within the browser, which is described by Equation 7.28. d= (X9 , Δt) = √ (0.0835439 + 0.00570466 Δt)2 Δt < 25808.56167 1 Δt ≥ 25808.56167 (7.28) Figure 7.24 shows the model learned for the agreement decay of the Plugins attribute, which is described by Equation 7.29. d= (X9 , Δt) = e−10.1405+0.98999 ln Δt Δt < 28086.17546 1 Δt ≥ 28086.17546 153 (7.29) Figure 7.23: Disagreement decay for the Plugins attribute Figure 7.24: Agreement decay for the Plugins attribute 7.3.3.10 Fonts Figure 7.25 shows the model learned for the disagreement decay of the Fonts attribute. The regression function is not deﬁned in the interval [0, 1102.506998) as the radicand expression produces negative numbers. For such interval the model shown in Figure 7.26 has been obtained by performing and additional regression speciﬁcally for the interval [0, 1102.506998). Joining both models, the disagreement decay the Fonts attribute is described by Equation 7.30. 154 Figure 7.25: Disagreement decay for the Fonts attribute (second interval) Figure 7.26: Disagreement decay for the Fonts attribute (ﬁrst interval) = d (X10 , Δt) = ⎧ ⎪ (0.012879 + 0.0605308 ln Δt)2 ⎨ √ ⎪ ⎩ 0 < Δt < 1102.506998 −0.230069 + 0.00692895 Δt 1102.506998 ≤ Δt < 31515.49207 Δt ≥ 31515.49207 1 (7.30) Figure 7.27 shows the model learned for the agreement decay of the Fonts attribute, which is described by Equation 7.31. d= (X10 , Δt) = 0.0000332742 + 0.0000330753Δt Δt < 30233.03570 1 Δt ≥ 30233.03570 155 (7.31) Figure 7.27: Agreement decay for the Fonts attribute 7.3.3.11 Video Figure 7.28 shows the model learned for the disagreement decay of the Video attribute, which is described by Equation 7.32. = d (X11 , Δt) = √ (0.233452 + 0.00417302 Δt)2 Δt < 33742.54047 1 Δt ≥ 33742.54047 (7.32) Figure 7.29 shows the model learned for the agreement decay of the Video attribute, which is described by Equation 7.33. d= (X11 , Δt) = 0.0000673283 + 0.0000331751Δt Δt < 30141.05976 1 Δt ≥ 30141.05976 156 (7.33) Figure 7.28: Disagreement decay for the Video attribute Figure 7.29: Agreement decay for the Video attribute 7.3.3.12 Time zone Figure 7.30 shows the model learned for the disagreement decay of the Time zone attribute, which is described by Equation 7.34. d= (X12 , Δt) = (0.12658 + 0.000024006Δt)2 Δt < 36383.40415 1 Δt ≥ 36383.40415 (7.34) Figure 7.31 shows the model learned for the agreement decay of the Time zone attribute, which is described by Equation 7.34. 157 Figure 7.30: Disagreement decay for the Time zone attribute Figure 7.31: Agreement decay for the Time zone attribute d= (X12 , Δt) = 7.3.3.13 0.0000337212 + 0.0000337212Δt Δt < 30222.05737 1 Δt ≥ 30222.05737 (7.35) Session Storage Figure 7.32 shows the model learned for the disagreement decay of the Session storage attribute, which is described by Equation 7.36. As happened with the Accept-Language, Accept-Charset, Accept-Encoding, and Cache-Control attributes such disagreement decay grows very slowly. 158 Figure 7.32: Disagreement decay for the Session Storage attribute Figure 7.33: Agreement decay for the Session storage attribute d= (X13 , Δt) = e−5.91823+0.0303167 1 √ Δt Δt < 38108.32197 Δt ≥ 38108.32197 (7.36) Figure 7.33 shows the model learned for the agreement decay of the Session storage attribute, which is described by Equation 7.37. d= (X13 , Δt) = 0.0000331289 + 0.0000330721Δt Δt < 30235.96539 1 Δt ≥ 30235.96539 159 (7.37) 7.3.3.14 Local Storage Figure 7.34 shows the model learned for the disagreement decay of the Local storage attribute, which is described by Equation 7.38. d= (X14 , Δt) = e−6.04222+0.0306873 1 √ Δt Δt < 38768.20650 Δt ≥ 38768.20650 (7.38) Figure 7.35 shows the model learned for the agreement decay of the Local storage attribute, which is described by Equation 7.39. Figure 7.34: Disagreement decay for the Local storage attribute Figure 7.35: Agreement decay for the Local Storage attribute 160 d= (X14 , Δt) = 7.3.3.15 0.0000331289 + 0.0000330721Δt Δt < 30235.96539 1 Δt ≥ 30235.96539 (7.39) Internet Explorer Persistence Figure 7.36 shows the model learned for the disagreement decay of the Internet Explorer persistence attribute, which is described by Equation 7.40. d= (X15 , Δt) = e−6.29214+0.0331916 1 √ Δt Δt < 35936.88071 Δt ≥ 35936.88071 (7.40) Figure 7.37 shows the model learned for the agreement decay of the Internet Explorer persistence attribute, which is described by Equation 7.41. d= (X15 , Δt) = 0.0000331737 + 0.0000330731Δt Δt < 30235.04982 1 Δt ≥ 30235.04982 (7.41) Figure 7.36: Disagreement decay for the Internet Explorer persistence attribute 161 Figure 7.37: Agreement decay for the Internet Explorer persistence attribute 7.4 Evaluation We have evaluated the four variants of the technique for uniquely identifying users based in the ﬁngerprint of their devices described in this chapter, which are the following: 1. Assigning equal weight to each ﬁngerprint attribute. 2. Assigning the entropy of the attribute as attribute weight. 3. Taking into account agreement and disagreement decays. 4. Combining attribute entropy with agreement and disagreement decay. As described in Section 7.3.2 we have used a corpus of activity records with users identiﬁed with the technique based on cookies as gold standard. The evaluation has been performed with diﬀerent values of θ (i.e. threshold at which it is considered that two ﬁngerprints correspond to the same browser). For each variant and threshold, we have measured algorithm performance, according to a set of evaluation metrics. For the variants that require to train a decay and/or entropy model (i.e. all with the exception of the one based on uniform weights), we have performed 2-fold cross-validation, dividing the dataset into two subsets. We have assigned randomly records to each subset, so that both subsets are equal in size. For each subset, we have learned decay and entropy values, and evaluated the algorithm 162 performance with the other subset, letting us to recommend the best algorithm variant, and to compare our results with previous work. This section is structured as follows: • Section 7.4.1 describes the metrics used for evaluating the technique. • Section 7.4.2 presents the evaluation results obtained for each variant and threshold, comparing such results and obtaining an optimum setting. 7.4.1 Evaluation Metrics The technique proposed for unique user identiﬁcation can be evaluated as a clustering algorithm since its objective is to group ﬁngerprint records corresponding to unique users. Most of the metrics used for evaluating this work interpret the clustering as a set of decisions, one for each of the N (N − 1)/2 pairs of elements (i.e. pairs of ﬁngerprint records). In this context: • T P is the number of true positive decisions. A true positive decision assigns two ﬁngerprints corresponding to the same user to the same cluster. • T N is the number of true negative decisions. A true negative decision assigns two ﬁngerprints corresponding to distinct users to diﬀerent clusters. • F P is the number of false positive decisions. A false positive decision assigns two ﬁngerprints corresponding to distinct users to the same cluster. • F N is the number of false negative decisions. A false negative decision assigns two ﬁngerprints corresponding to a same user to diﬀerent clusters. Taking into account the T P , T N , F P , and F N indicators, the metrics used for evaluating the performance of the technique for unique user identiﬁcation are described next. 163 7.4.1.1 Rand Index The Rand Index metric [Rand, 1971] measures the percentage of correct clustering decisions. Equation 7.42 shows its deﬁnition. RI = TP + TN TP + FP + TN + FN (7.42) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.9. 7.4.1.2 Error Rate The Error Rate metric [Kohavi and Provost, 1998] measures the percentage of incorrect decisions. Equation 7.43 shows its deﬁnition. Error = FP + FN TP + FP + FN + TN (7.43) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are bellow 0.1, as Error = 1 − RI. 7.4.1.3 Recall The Recall metric [Kowalski, 1997] (a.k.a. sensitivity or hit rate) is the true positive rate. Equation 7.44 shows its deﬁnition. Recall = TP TP + FN (7.44) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.85. 7.4.1.4 Speciﬁcity The Speciﬁcity metric [Kohavi and Provost, 1998] is the true negative rate. Equation 7.45 shows its deﬁnition. Specif icity = 164 TN FP + TN (7.45) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.9. 7.4.1.5 False Positive Rate Equation 7.46 deﬁnes the False Positive Rate metric [Kohavi and Provost, 1998] (a.k.a. fall-out). FP FP + TN FPR = (7.46) The range of this metric is [0..1]. We consider satisfcactory values for this metric those that are bellow 0.1. 7.4.1.6 False Negative Rate Equation 7.47 deﬁnes the False Negative Rate [Kohavi and Provost, 1998] metric. F NR = FN FN + TP (7.47) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are bellow 0.15, as F N R = 1 − Recall. 7.4.1.7 Precision The Precision metric [Kowalski, 1997] is deﬁned as the positive predictive value. Equation 7.48 shows its deﬁnition. P recision = TP TP + FP (7.48) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.9. 7.4.1.8 F-measure The F-measure metric [Larsen and Aone, 1999] combines the precision and recall metrics oﬀering an overall vision of how the technique behaves. It is deﬁned as 165 the harmonic mean of precision and recall. Equation 7.49 shows its deﬁnition. F1 = 2 · P recision · Recall P recision + Recall (7.49) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.87, taking into account the minimum Precision and Recall satisfactory values. 7.4.1.9 Purity This metric, deﬁned by Zhao and Karypis [2001], represents clusters’ purity. To calculate it we assign the most frequent user in the cluster for each ﬁngerprint cluster obtained. Then, the classiﬁcation performance is measured as the number of ﬁngerprint records assigned correctly to a cluster, divided by the total number of records. Let Ω = {ω1 , ω2 , ..., ωk } be the set of clusters obtained, C = c1 , c2 , ..., cj the number of users, and N the total number of ﬁngerprint records, the Purity metrics is obtained as shown by Equation 7.50. P urity(Ω, C) = 1 max|ωk ∩ cj | N k j (7.50) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.85. 7.4.2 Evaluation Results This section presents the evaluation results for the variants of the technique and compares them, obtaining and optimum combination of variant and threshold at which it is considered that two ﬁngerprints correspond to the same user’s browser. 7.4.2.1 Variant Based on Uniform Weights This variant assigns the same weight for all the ﬁngerprint attributes. Therefore, all of these attributes have the same importance for determining whether two ﬁngerprints correspond to a same browser. 166 Measure Rand Index Error Rate Recall Speciﬁcity False Positive Rate False Negative Rate Precision F-measure Purity θ = 0.7 0.9978 0.0011 0.41 0.99836 0.00164 0.59 0.20 0.27 0.44 θ = 0.75 0.9994 0.0003 0.74 0.99965 0.00035 0.26 0.67 0.7 0.77 θ = 0.8 0.9998 0.0001 0.91 0.99985 0.00015 0.09 0.85 0.88 0.92 θ = 0.85 0.9998 0.0001 0.91 0.99986 0.00014 0.09 0.86 0.88 0.92 θ = 0.9 0.9996 0.0002 0.63 0.99995 0.00005 0.37 0.92 0.75 0.96 θ = 0.95 0.9994 0.0003 0.39 0.99996 0.00004 0.61 0.91 0.54 0.96 Table 7.9: Evaluation results for the variant based on uniform weights Table 7.9 shows the evaluation results corresponding to diﬀerent values of θ, from where the following insights can be obtained: • The Rand Index and Error Rate metrics are good for all the values assigned to θ. • The Speciﬁcity and False Positive Rate metrics are good for all the values assigned to θ. • The Recall and False Negative Rate metrics are good for θ = 0.8 y θ = 0.85. • The Precision metric is good for θ = 0.9 and θ = 0.95, although for these values, the recall is not admissible. • The F-measure metric is acceptable for θ = 0.8 and θ = 0.85. • The Purity metric is good for θ > 0.8. The values that optimise the corresponding metrics among all the variants are marked in bold in Table 7.9. 7.4.2.2 Variant Based on Attribute Entropy This variant assigns to the weight of each ﬁngerprint attribute its corresponding entropy. Therefore, each attribute has an importance that is proportional to the quantity of information that it provides for distinguishing a ﬁngerprint record 167 from other, or for clustering ﬁngerprints that correspond to a same user. As an example, the plugins installed in the browser will have more weight than the time zone. Table 7.10 shows the evaluation results corresponding to diﬀerent values of θ, from where the following insights can be obtained: • The Rand Index and Error Rates metrics are good for all the values assigned to θ. • The Speciﬁcity and False Positive Rate metrics are good for all the values assigned to θ. • The Recall and False Negative Rate metrics are not as good as with other variants. • The Precision metric is good for all values of θ. • The F-measure metric is not as good as with other variants. • The Purity metric is good for all the values of θ. The values that optimise the corresponding metrics among all the variants are marked in bold in Table 7.10. Measure Rand Index Error Rate Recall Speciﬁcity False Positive Rate False Negative Rate Precision F-measure Purity θ = 0.7 0.9996 0.0002 0.64 0.99995 0.00005 0.36 0.92 0.76 0.95 θ = 0.75 0.9996 0.0002 0.64 0.99995 0.00005 0.36 0.92 0.75 0.95 θ = 0.8 0.9996 0.0002 0.64 0.99995 0.00005 0.36 0.92 0.75 0.96 θ = 0.85 0.9994 0.0003 0.44 0.99996 0.00004 0.56 0.91 0.60 0.96 θ = 0.9 0.9994 0.0003 0.41 0.99996 0.00004 0.59 0.91 0.56 0.96 θ = 0.95 0.9994 0.0003 0.40 0.99996 0.00004 0.60 0.91 0.56 0.97 Table 7.10: Evaluation results for the variant based on attribute entropy 168 7.4.2.3 Variant Based on Time Decay This variant assigns to the weight of each ﬁngerprint attribute its corresponding agreement and disagreement decays. Therefore each attribute has an importance proportional to the probability of change or sharing between ﬁngerprint records. Table 7.11 shows the evaluation results corresponding to diﬀerent values of θ, from where the following insights can be obtained: • The Rand Index and Error Rates metrics are good for all the values assigned to θ. • The Speciﬁcity and False Positive Rate metrics are good for all the values assigned to θ. • The Recall and False Negative Rate metrics are not as good as with other variants. • The Precision metric is good for θ = 0.95. • The F-measure metric is not as good as with other variants. • The Purity metric is good for θ = 0.9 y θ = 0.95. The values that optimise the corresponding metrics among all the variants are marked in bold in Table 7.11. Measure Rand Index Error Rate Recall Speciﬁcity False Positive Rate False Negative Rate Precision F-measure Purity θ = 0.7 0.9977 0.0012 0.31 0.99832 0.00168 0.69 0.15 0.2 0.32 θ = 0.75 0.9986 0.0007 0.36 0.99921 0.00079 0.64 0.31 0.22 0.45 θ = 0.8 0.9991 0.0005 0.42 0.99964 0.00036 0.58 0.53 0.47 0.61 θ = 0.85 0.9994 0.0003 0.61 0.99981 0.00019 0.39 0.76 0.68 0.79 θ = 0.9 0.9997 0.0002 0.74 0.99991 0.00009 0.26 0.89 0.81 0.89 θ = 0.95 0.9995 0.0002 0.53 0.99996 0.00004 0.47 0.92 0.67 0.95 Table 7.11: Evaluation results for the variant based on time decay 169 7.4.2.4 Variant Based on Attribute Entropy and Time Decay This variant assigns to the weight of each ﬁngerprint attribute a combination of its corresponding entropy, and agreement and disagreement decays. Therefore each attribute has an importance proportional to the quantity of information that adds for distinguishing a ﬁngerprint from another, as well as to the probability of change or sharing between ﬁngerprint records. Table 7.12 shows the evaluation results corresponding to diﬀerent values of θ, from where the following insights can be obtained: • The Rand Index and Error Rates metrics are good for all the values assigned to θ. • The Speciﬁcity and False Positive Rate metrics are good for all the values assigned to θ. • The Recall and False Negative Rate metrics are good for θ = 0.7, θ = 0.75 y θ = 0.8. • The Precision metric is good for all the values of θ. • The F-measure metric is good for θ = 0.7, θ = 0.75, and θ = 0.8. • The Purity metric is good for all values of θ. The values that optimise the corresponding metrics among all the variants are marked in bold in Table 7.12. 170 Measure Rand Index Error Rate Recall Speciﬁcity False Positive Rate False Negative Rate Precision F-measure Purity θ = 0.7 0.9998 0.0001 0.89 0.9999 0.00010 0.11 0.9 0.9 0.88 θ = 0.75 0.9998 0.0001 0.88 0.99992 0.00008 0.12 0.91 0.9 0.91 θ = 0.8 0.9998 0.0001 0.87 0.99993 0.00007 0.13 0.93 0.9 0.94 θ = 0.85 0.9996 0.0002 0.62 0.99995 0.00005 0.38 0.92 0.74 0.95 θ = 0.9 0.9994 0.0002 0.46 0.99996 0.00004 0.54 0.91 0.61 0.95 θ = 0.95 0.9994 0.0002 0.45 0.99996 0.00004 0.55 0.91 0.6 0.96 Table 7.12: Evaluation results for the variant based on attribute entropy and time decay 7.4.2.5 Comparison of the Variants Figure 7.38 shows a ROC (Receiver Operating Characteristic) graph [Egan, 1975] with plots representing the algorithm variants with diﬀerent thresholds. A ROC space is deﬁned by False Positive Rate and Recall (or True Positive Rate) metrics as x and y axes respectively, which depicts relative trade-oﬀs between true positive (beneﬁts) and false positive (costs). The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing no false negatives and no false positives (perfect classiﬁcation). Therefore, the best-performing variants are in the upper left corner of the ﬁgure. Such variants are the one that uses the same weight for all ﬁngerprint attributes (for θ = 0.8 and θ = 0.85), and the one that takes into account entropy and decay (for θ = 0.7, θ = 0.75 and θ = 0.8). Taking into account entropy or decay by separate (second and third variant) do not produce the better results than the uniform weights variant. Table 7.13 compares the variants that provide better results (optimous results in bold), showing the following insights: • The Rand Index and Error Rate metrics are the same for all the variants and thresholds. • The Recall and False Negative Rate metrics are slightly better for the variant that assigns the same weight for all the attributes (ﬁrst variant), al- 171 Figure 7.38: Performance of the variants evaluated for the technique for unique user identiﬁcation based on evolving device ﬁngerprint detection though they are acceptable for the variant that takes into account decay and entropy (fourth variant). • On the other hand, Speciﬁcity and False Positive Rate are slightly better for the fourth variant, although they are acceptable for the ﬁrst variant. • Precision is better for the fourth variant (over 0.9). • F-measure is higher for the fourth variant (over 0.9). • In addition, the algorithm achieves better purity values for the fourth variant with θ = 0.8 (P urity = 0.94). In summary, the variant that behaves better is the one that takes into account entropy and decay, since it provides the maximum values of Rand Index, Fmeasure and Purity. 172 Measure Rand index Error rate Recall (or sensitivity) Speciﬁcity False positive rate False negative rate Precision F-measure Purity Uniform weights θ = 0.8 θ = 0.85 0.9998 0.9998 0.0001 0.0001 0.91 0.91 0.99985 0.99986 0.00015 0.00014 0.09 0.09 0.85 0.86 0.85 0.86 0.92 0.92 Decay and entropy θ = 0.7 θ = 0.75 θ = 0.8 0.9998 0.9998 0.9998 0.0001 0.0001 0.0001 0.89 0.88 0.87 0.9999 0.99992 0.99993 0.00010 0.00008 0.00007 0.11 0.12 0.13 0.9 0.91 0.93 0.9 0.9 0.9 0.88 0.91 0.94 Table 7.13: Comparison of the variants with more performance 7.5 Hypothesis Validation In comparison with the algorithm proposed by Eckersley [2010], our algorithm behaves better, since the accuracy of the former is 0.991, while the accuracy of the latter is 0.9998. The false positive rate of Eckersley [2010] is 0.0086 while ours is almost zero (0.00007). Moreover, the algorithm described by Eckersley [2010] only classiﬁes the 65% of the ﬁngerprints (when the browser has Java Virtual Machine or Flash installed). By contrast, our algorithm makes a classiﬁcation in all the cases, regardless Flash or Java Virtual Machine. The evaluation performed to our approach for unique user identiﬁcation validates the Hypothesis 2 of this work, since our technique allows grouping and identifying the activity generated by website visitors through the digital ﬁngerprint of their devices, even when such ﬁngerprint varies over time, with a higher performance along diﬀerent metrics than the previous existing approach. 173 174 Chapter 8 TECHNIQUES FOR SEGMENTATION OF CONSUMERS FROM SOCIAL MEDIA CONTENT This chapter describes another main contribution of this thesis to the State of the Art, which consists in a collection of techniques for extracting socio-demographic and psychographic proﬁles from social media users applied to the marketing domain, trough the analysis of the opinions they express about brands, as well as from the proﬁles published by them in social networks. Speciﬁcally, these techniques are the following: • A technique for classifying consumer opinions produced in social media according to the Consumer Decision Journey stages, which is described in Section 8.2. • A technique for classifying consumer opinions produced in social media according to the Marketing Mix framework, which is described in Section 8.3. • A technique for analysing consumer opinions written in Spanish according to the emotions expressed in such opinions, which is described in Section 8.4. 175 • A technique for obtaining the place of residence of social media users, which is described in Section 8.5. • A technique for identifying the place of residence of social media users, which is described in Section 8.6. Additionally, the contributions of this thesis that perform content analysis rely in a common task for gathering the corpora used for learning and evaluation purposes, a common activity for pre-processing user-generated contents before modelling, and a modelling technique based on rule matching. Section 8.1 describes such common elements. Finally, the evaluation results are presented in Section 8.7. After that, in Section 8.8 we validate the hypotheses formulated in Section 3.4 regarding sociodemographic and psychographic segmentation of consumers. The techniques described in this chapter implement generic activities and tasks deﬁned by the CRISP-DM methodology [Shearer, 2000], which has been described in Section 4.3.2. 8.1 Common Elements Used by the Techniques The content-analysis techniques described in this thesis have been trained and evaluated with corpora extracted from social media. Section 8.1.1 describes the data collection task used for obtaining such corpora, while Section 8.1.2 describes the technique used for preparing the corpora used by the content-analysis contributions of this thesis. In addition, two techniques presented in this thesis (i.e. the technique for detecting Consumer Decision Journey stages and the technique for identifying emotions) make use of rule-based models, which rely on a variety of linguistic information such as lexical items or morphosyntactic features (e.g. future tense). Such models have been developed following the modelling technique described in Section 8.1.3. 176 ad Collect Initial Data Task «parallel» Search Extract links language Opinion Clipping link text clips paragraphs brand terms Figure 8.1: Initial Data Collection task executed by the content-analysis techniques 8.1.1 Collect Initial Data Task This task implements the Collect Initial Data generic task of the CRISP-DM methodology [Shearer, 2000] (see Section 4.3.2.2). It is oriented to ﬁnd and retrieve from diﬀerent social media textual contents that mention brands. The workﬂow followed by this task is shown in Figure 8.1 and consists in the ordered execution of the steps described next. Search. This step consists in deﬁning a pool of brands with a list of lexical variants for each one (e.g. “Coca Cola” and “Coke” for the brand Coca Cola and using social media search services for looking for texts written in a set of objective languages that mention any of those brands, retrieving the links highlighted by the search results. In our work we used the search services provided by Google78 , Facebook79 , and Twitter80 . Extract. This step consists in retrieving and extracting the textual content referred by the links of the search results. Texts from structured data sources (i.e. from Twitter and Facebook) are directly retrieved from the values of the message attribute included in 78 https://developers.google.com/custom-search https://developers.facebook.com/docs/graph-api 80 https://dev.twitter.com/docs/api/1.1/get/search/tweets 79 177 the structured data object obtained by querying the corresponding REST [Fielding, 2000] API. Texts from unstructured data sources (i.e. web pages) are obtained by performing a scraping technique oriented to remove HTML mark-up. Opinion Clipping. Once the texts from each speciﬁc social media format have been collected, this step extracts the paragraphs (i.e. clips) that mention the selected brands (i.e. that contain at least one term of the list of terms used by the Search task). 8.1.2 Data Preparation Activity This task implements the Data Preparation generic activity of the CRISP-DM methodology (see Section 4.3.2.3). Once the content is retrieved, the goal of this activity is to ﬁlter the texts that are not relevant, either because they do not mention the brand, are written in a diﬀerent language than the target language, or do not contain user-generated content. In addition, NLP (Natural Language Processing) tools were used to obtain the linguistic information upon which the content-analysis techniques were based. The texts were processed and annotated with linguistic information such as partof-speech, verb tense, and person. For these NLP tools to work properly, it was also crucial to normalise the texts that contain many typos, abbreviations, emoticons, etc. For enhancing the performance of the content analysis techniques described in this thesis, the data preparation activity executes a morphological normalisation of user-generated content. Such technique makes use of several gazetteers extracted from diﬀerent open data sources collectively developed, including a SMS lexicon and Wikipedia. Wikipedia has been used in the past for diﬀerent NLP activities, such as text categorisation [Gabrilovich and Markovitch, 2006], topic identiﬁcation [Coursey et al., 2009], measuring the semantic similarity between texts [Gabrilovich and Markovitch, 2007], and word sense disambiguation [Mihalcea, 2007], among others. This activity consists in the ordered execution of the tasks shown in Figure 8.2, which are described next. 178 ad Data Preparation Activity Select Data Clean Data language brands Construct Data paragraphs cleansed data selection criteria paragraphs normalised posts Figure 8.2: Data Preparation Activity implemented by the content-analysis techniques 8.1.2.1 Select Data Task As described in Section 8.1.1 the Collect Initial Data task looks for contents written in a target language that refer to a commercial brand. For doing so, it uses the content retrieval APIs provided by social media. Such APIs may output false positives of the following kinds: 1. Posts that syntactically contain a brand term that do not refer to the brand itself. This is mainly due to the use of ambiguous terms (e.g. “Orange” may refer to a telecommunications company, a fruit or a colour). 2. The social network’s API do not have language detection capabilities, or retrieves posts that have been tagged with a given language but are not actually written in such language. For dealing with these situations, this task establishes the criteria for selecting the textual contents to be used from the collected raw data. Such contents must satisfy the following criteria: 1. The text of each post must contain a mention to a commercial brand. For automatically selecting the correct senses, two lists are added to the data selection criteria: • A list of mandatory terms that includes terms related to senses that refer to the brand (e.g. a text that contains “phone” or “mobile” may refer to the telecommunication company Orange). 179 • A list of forbidden terms that includes terms related to senses in which we are not interested (e.g. a text that contains “fruit” or “dessert” is more likely to refer to a sense of the word Orange diﬀerent than the telecommunication company). 2. The text must be written in the target language for which the model will be learned. The task described next deals with removing contents from the dataset that do not satisfy the previous criteria. 8.1.2.2 Clean Data Task This task consists in removing the contents that are not relevant for the goal of the activity to be performed after preparing data. The workﬂow followed by this task is shown in Figure 8.3 and consists in the ordered execution of the steps described next. Filter Irrelevant Content. This step consists in automatically ﬁltering the texts that syntactically contain one of the brand terms used for looking up the opinions, but do not refer to the correct sense (i.e. the brand). For ad Clean Data Task data selection criteria «parallel» mandatory terms forbidden terms language Filter Irrelevant Content Filter Language paragraph paragraph paragraphs [paragraph is not relevant] Filter SPAM paragraph [other language] Manual Revision paragraph [paragraph is SPAM] filtered paragraphs paragraphs cleansed Figure 8.3: Clean data task executed by the content-analysis techniques 180 doing so, this task takes out the texts that contain at least one forbidden term or that do not contain at least one mandatory term. Filter Language. This step consists in automatically removing the texts that are not written in the language for which the texts are being extracted. To do so, we have implemented a language detection component that combines multiple language classiﬁers and returns the language which has been detected the most by such classiﬁers. The language classiﬁers used are the following: • The Freeling’s [Padr´o and Stanilovsky, 2012] language identiﬁcation module. • The Java Text Categorising Library81 that implements the text categorisation algorithm described by Cavnar and Trenkle [1994]. • The LingPipe82 toolkit for computational linguistics. • The language identiﬁcation components provided by the Apache Tika83 framework. • The JLangDetect84 library. Filter SPAM. Since the text extraction technique applied in the Data Collection Task for unstructured formats may return pieces of text included in advertisements or navigation options of the web page, this step discards those texts in which brands are not part of the main content of the document, following Ntoulas et al. [2006] guidelines. After studying a representative set of 1,000 texts extracted from web pages, we decided that a text (with at least an occurrence of a brand) is invalid (i.e. it does not belong to the main content) unless it includes at least 30% of words belonging to the following list of grammatical categories: adpositions, determiners, conjunctions and pronouns. 81 http://textcat.sourceforge.net http://alias-i.com/lingpipe 83 http://tika.apache.org 84 http://github.com/melix/jlangdetect 82 181 To get the grammatical category of each word of the texts we made use of a part-of-speech tagger (see Section 2.6.1). Speciﬁcally, we used Freeling. Manual Revision. This step consists in manually reviewing the texts obtained after applying the automatic ﬁltering heuristics described above, discarding irrelevant or useless contents, such as texts written in other languages that are not detected by the Filter Language step, or texts referring to other senses diﬀerent than the brand that are not detected by the Filter Not Relevant Content step. The ﬁnal corpus obtained after performing this step consists of the remaining texts, with annotations of the source from which they have been collected, the brand mentioned in the texts, and the domain to which they belong. 8.1.2.3 Construct Data Task The content analysis techniques presented in this thesis rely on linguistic patterns. In order to match these patterns with texts, these texts have to be processed and annotated with linguistic information such as part-of-speech, verb tense, and person. Linguistic processing is carried out by an automatic tagger. However, such tagger cannot properly work with user-generated texts as the ones our techniques analyse. This is because social media user-generated texts contain a large number of misspellings, abbreviations and jargon words. Badly written texts imply a great amount of errors in the part-of-speech annotation process and, consequently, without a normalisation phase the developed classiﬁers do not work correctly. For dealing with this issue we have implemented the workﬂow shown in Figure 8.4. The phases involved in the data preparation task are described next. Sanitise. This phase transforms the text received by removing non-printable characters (i.e. control and format characters like the null character) and by converting diﬀerent variations of the space character (e.g. non-breaking space, tab) into the standard whitespace symbol. Tokenise. This phase receives the text to be normalised and breaks it into words, Twitter metalanguage elements (e.g. hash-tags, user IDs), emoti- 182 ad Construct Data Task paragraphs cleansed «parallel» «parallel» standard language dictionary Sanitise sanitised post tokens Tokenise token Normalise Twitter Metalanguage Element Twitter metalanguage element normalised form Classify Token correct OOV words word in standard vocabulary correct OOV OR word variation Classify OOV Word SMS dictionary OOV word spell checker dictionary variation OR unknown OR correct normalised forms Concatenate Normalised Forms normalised post Check & Correct Spell OOV word normalised posts Figure 8.4: Construct data task executed by the content-analysis techniques cons, URLs, etc. The output (i.e. the list of tokens) is sent to the Classify Tokens phase. In our experiments, we used Freeling for social media content tokenisation. Its speciﬁc tokenisation rules and its user map module were adapted for dealing with smileys and particular elements typically used in Twitter, such as hash-tags, RTs, and user IDs. Classify Tokens. The input of this phase is the list of tokens generated by the tokeniser. It classiﬁes each of them into one of the following categories: • Twitter metalanguage elements (i.e. hash-tags, user IDs, RTs and URLs). Such elements are detected by matching regular expressions against the token (e.g. if a token starts by the symbol “#”, then it is a hash-tag). Each token classiﬁed in this category is sent to the Normalise Twitter Metalanguage Element phase. • Words contained in a standard language dictionary, excluding proper 183 nouns. Each token classiﬁed in this category is sent to the Concatenate Normalised Forms phase. • Out-Of-Vocabulary (OOV) words. These are words that neither are found in a standard dictionary nor are Twitter metalanguage elements. Each token classiﬁed in this category is sent to the Classify OOV Word phase. We use the part-of-speech tagging module of Freeling within this phase. As we deactivate Freeling’s probability assignment and unknown word guesser module, all the words that are not contained in Freeling’s POS-tagging dictionaries are not marked with a tag and are considered as OOV words. Our standard vocabularies are, thus, the Freeling dictionaries themselves for English and Spanish. Additionally, for Spanish we have extended the standard vocabulary with a list of correct forms generated from the lemmas found in the Real Academia Espa˜ nola Dictionary (DRAE) by Gamallo et al. [2013]. Classify OOV Word. This phase receives every token previously classiﬁed as out-of-vocabulary by the previous phase and detects if it is correct, wrong, or unknown. If the token is wrong, it returns the correct form of the token. The task executes the following steps: 1. Firstly, the token is looked up in a secondary dictionary for those words that are not in a standard dictionary but that are known to correspond to correct forms (mostly proper nouns). The search disregards both case and accents. We have populated this secondary dictionary by making use of the list of article titles from Wikipedia85 . To speedup the process of querying the Wikipedia article titles (31,528,653 for English and 4,391,392 for Spanish), we uploaded them to a HBASE store86 . In order to increase the coverage of this dictionary, we incorporated into it two lists of ﬁrst names obtained from the United States Census Bureau87 and from the Spanish National Institute of Statis85 http://en.wikipedia.org/wiki/Wikipedia:Database_download http://hbase.apache.org 87 http://www.census.gov 86 184 tics88 . The list of ﬁrst names for the English language contains 1,218 male names and 4,273 female names, while the list for the Spanish language contains 18,679 male names and 19,817 female names. (a) If an exact match of the token is found in the dictionary (e.g. both forms are capitalised), then the token is classiﬁed as Correct and sent to the Concatenate Normalised Forms phase with no variation. (b) If the token is found with variations of case or accentuation, then the token is classiﬁed as Variation and its correct form is sent to Concatenate Normalised Forms phase. (c) If the token is not found in the dictionary, then the process continues in step 2. 2. The token is looked up in a SMS dictionary that contains tuples with the SMS term and its corresponding correct form. The search is caseinsensitive, and does not consider accent marks. We have populated such a dictionary with 898 common-used SMS terms for English extracted from diﬀerent web sources. For Spanish, we have reused the SMS dictionary of the Spanish Association of Internet Users89 , which contains 53,281 entries. (a) If the token is found in the SMS dictionary, then it is classiﬁed as Variation and its correct form is retrieved and sent to the Concatenate Normalised Forms phase. (b) If the token is not found in the dictionary, then it is sent to the Check and Correct Spell phase. Check and Correct Spell. This phase checks the spelling of the token received and returns its correct form when possible. To do so, it executes the following steps: 1. Firstly, the token is matched against regular expressions to ﬁnd whether it contains characters (or sequences of characters) repeated more than 88 89 http://www.ine.es/inebmenu/indice.htm http://aui.es 185 twice (e.g. “loooooollll” and “hahaha”). (a) If the token contains repeated characters (or sequences of characters), then the repeated ones are removed (e.g. “lol” and “ha”), and the resulting form is sent back to the Classify OOV word phase, since the new form may be included into the correct words set. (b) If the token does not contain repeated characters (or sequences of characters), then the process continues in step 2. 2. The token is sent to an existing spell checking and correction implementation. We make use of Jazzy90 , an open-source Java library. For the creation of the spell checker dictionaries used by Jazzy, we made use of the diﬀerent varieties of English and Spanish dictionaries91 . The resulting dictionaries contain 237,667 terms for English and 683,462 terms for Spanish. (a) If the spell checking is correct, then the token is classiﬁed as Correct and sent to the Concatenate Normalised Forms phase without a variation. (b) If the spell checking is not correct, then the token is classiﬁed as Variation and the ﬁrst correct form returned by the spelling corrector is sent to Concatenate Normalised Forms phase. (c) If the spell checker is not able to propose a correct form, the token is classiﬁed as Unknown and is sent to the Concatenate Normalised Forms phase without a variation. Normalise Twitter Metalanguage Element. This phase performs a syntactic normalisation of Twitter meta-language elements. Speciﬁcally, it executes the rules enumerated next. 1. Remove the sequence of characters “RT” followed by a mention to a Twitter user (marked by the symbol “@”) and, optionally, by a colon punctuation mark; 90 91 http://jazzy.sourceforge.net http://sourceforge.net/projects/jazzydicts 186 2. Remove user IDs that are not preceeded by a coordinating or subordinating conjunction, a preposition, or a verb; 3. Remove the word “via” followed by a user mentioned at the end of the tweet; 4. Remove all the hash-tags found at the end of the tweet; 5. Remove all the “#” symbols from the hash-tags that are maintained; 6. Remove all the hyper-links contained within the tweet; 7. Remove ellipsis points that are at the end of the tweet, followed by a hyper-link; 8. Replace underscores with blank spaces; and 9. Divide camel-cased words into multiple words (e.g. “DoNotLike” is converted to “Do Not Like”). As an example, after applying metalanguage normalisation, the tweet RT @AshantiOmkar: Fun moments with @ShwetaMohan at the O2! She was wearing a #DVY #DarshanaVijayYesudas outﬁt! http://t.co/... is converted into the text Fun moments with Shweta Mohan at the O2! She was wearing a DVY Darshana Vijay Yesudas outﬁt! which is easier for being processed by a part-of-speech tagger. Concatenate Normalised Forms. This phase receives the normalised form of each token and amends the post. 8.1.3 Rule-based Modelling Technique The techniques for detecting Consumer Decision Journey stages and for identifying emotions in user-generated content are based on the recognition of patterns 187 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ::= ( | < classiﬁcation rule >)∗ ::= . < classiﬁcation rule > ::= ”−>” ::= ( | | | | | | )+ ::= ’”’ ’”’ ::= ::= [ ] ::= # ::= ENTITY ::= ”∗” ::= / / ::= ::= ::= ”+” | − | ”∗” ::= Listing 8.1: BNF grammar of the linguistic rules as sequences of particular words. These patterns are part of what we called “linguistic rules”; a description of the pattern as particular conditions that have to be met in order to consider the text and example of a particular category. The general structure of the linguistic rules is shown next. → The antecedent of the rule reﬂects the pattern/template of an expression in natural language and the consequent deﬁnes an action to be performed, which consists in modifying a numerical value associated to a given category. Listing 8.1 shows the BNF (Backus Naur Form) grammar [Backus et al., 1963] according to which the rules are expressed. Rules can be either deﬁned for performing classiﬁcation actions (e.g. incrementing the value for a given category) or as chunk actions (i.e. for dividing the text into fragments). The ﬁrst component of classiﬁcation and chunk rules is a linguistic pattern (see pattern in Listing 8.1). Such pattern describes the relevant features of a expression 188 in natural language at the morphosyntactic level. Each word of the pattern can be represented by itself —e.g. “girls”— or as its lemma —e.g. girl — alone, or with (some components of) its part-of-speech tag —e.g. girl#N. Sometimes, only the part-of-speech tag is important —e.g. [N] —, and some others, only the maximum number —e.g. /1/ — or the existence —e.g. * — of words matters. This allows for quite a ﬂexible speciﬁcation (see sections 8.2.2.2 and 8.4.2.3 for examples). Regarding classiﬁcation rules (see classiﬁcation rule in Listing 8.1), such rules perform an arithmetic operation over a value corresponding to a given category whenever the linguistic pattern is matched against the text. The operations available are addition, subtraction and multiplication, denoted by the operators “+”, “−” and “∗”, respectively. The addition and subtraction operations are used to designate the polarity of a classiﬁcation, as in the case of sentiment analysis (e.g. the adjective “smart” can be modelled with “+1”, while “fool ” can be modelled with “−1” ). The multiplication operation is useful to invert the polarity of a unit (e.g. the negation particle “no” can be modelled with “∗ − 1”), and to increase or decrease its value (e.g. the adverb “very” can be modelled with “∗2”, while “little” can be modelled with “∗0.5”). The rule engine executes the following steps for classifying a text: 1. Firstly, the lemma and the part-of-speech tag of every token (i.e. lexical unit) included in the text are obtained, outputting a sequence of tuples made up of the token, its lemma and its morphosyntactic category. Therefore, this step performs the lemmatisation and part-of-speech tagging of the tokens received as described in Section 2.6.1. In our experiments, the morphosyntactic annotations were added by the use of the Freeling part-of-speech tagger. Therefore, the part-of-speech tags used for English are those deﬁned by Santorini [1991] and for Spanish those standardised by Leech and Wilson [1996]. 2. In this second step, a sentence splitter divides the texts. Additionally, the set of chunk rules is applied in order to divide the text into the diﬀerent sequence units to be analysed (e.g. the conjunction “and” can determine two units: the one on its left side and the one on its right side). 189 In our experiments we reused Freeling’s sentence splitter. 3. The third step consists in identifying the linguistic patterns that match the entire text or a part of the text obtained in the previous step. For each sequence unit, it identiﬁes the antecedents of the rules that match all or part of the unit. If there are several antecedents that match the same part of the unit that overlap: (a) If their corresponding consequents aﬀect the same category, it selects the ﬁrst rule among the most restrictive ones (i.e. among the ones that match the longest text, the one found in ﬁrst place). Once the matching expressions have been detected, a tuple made up of the category (e.g. “PURCHASE”), the operation (e.g. “+”), and the value (e.g. “1”) of the consequent is appended to a list of operations. (b) If their corresponding consequents aﬀect diﬀerent categories, a tuple for each category is appended. Otherwise, i.e. if there is not a matching expression, nothing is appended to the list. As misspellings are likely to be found in user-generated content, this matching step is not case-sensitive and does not take into account accent marks. Therefore, all the words and lemmas contained either in the rules and in the texts are transformed to lowercase and accent marks are stripped from them. 4. When all the units of the text have been processed, the list of operations is computed. First the sum operations are carried out (i.e. the positive and the negative values are added up) and then the product operations are applied to the result of that addition (e.g. “* -1” for inverting the value due to a negation, or “* 2” for doubling the value due to an intensifying adverb). 5. As a result of that computation, a numeric value is obtained for each category contained in the consequents of the rules. 190 In the case of chunk rules (see chunk rule in Listing 8.1), such rules split a text into the fragments delimited by the linguistic pattern. For example the following rule “[CC] .” implies than whenever a coordinating conjunction is found within a text, such text will be divided into two fragments, the one before the coordinating conjunction, and the one after the coordinating conjunction. Classiﬁcation rules will apply to each fragment separately. 8.2 Technique for Detecting Consumer Decision Journey Stages In order to achieve one the objectives of this thesis, i.e. to develop a technique for automatically classifying short user-generated texts into stages of the Consumer Decision Journey, we have carried out the activities described next. 1. The Data Understanding activity collects a corpus of texts generated by consumers, creates a gold standard from the gathered corpus and validates that the gold standard is valid for learning purposes. The instantiation of this activity for gathering the corpus for the detection of Consumer Decision Journey stages is explained in Section 8.2.1. 2. The Data Preparation activity covers all the tasks required to construct the dataset for learning and evaluating the technique, including data cleansing and content normalisation. This activity is common to other contentanalysis techniques, and has been described in Section 8.1.2. 3. The Modelling activity engineers a rule-based model for classifying usergenerated content into Consumer Decision Journey stages. This activity is explained in Section 8.2.2. 8.2.1 Data Understanding Activity This activity consists in the ordered execution of the following tasks: 191 1. The Collect Initial Data task consists in gathering the corpus and creating the gold standard required for learning purposes. This task is described in Section 8.2.1.1. 2. The Describe Data task consists in performing a description of the format and volume of the gold standard. This task is described in Section 8.2.1.2. 3. The Explore Data task consists in performing a deeper statistical analysis of the gold standard from several viewpoints to ensure that it is valid for modelling purposes. This task is described in Section 8.2.1.3. 4. The Verify Data Quality task consists in examining the quality of the gold standard by attending to the analyses performed in the previous tasks. This task is described in Section 8.2.1.4. 8.2.1.1 Collect Initial Data Task This task applies the approach described in Section 8.1.1 for retrieving textual contents mentioning commercial brands from diﬀerent social media, and constructs the gold standard required for model creation and evaluation. In order to identify the linguistic patterns utilised to express the diﬀerent stages of the Consumer Decision Journey, and also to carry out the evaluations, this task builds a gold standard by manually annotating a corpus of usergenerated content according to the Consumer Decision Journey stages that can be derived from such content. To do so, human annotators are asked to tag each text with just one label following the description provided below. Awareness. All the texts that refer to advertisement campaigns or opinions about advertisements are generally expressed in ﬁrst person. These texts should contain information about the user’s experience with respect to the advertisement or the knowledge of the brand. For example92 : I love Ford’s ad 92 The examples included in this thesis correspond to individual comments about brands. Therefore, opinions presented in this document do not necessarily correspond to the view of the author, neither represent the majority judgements of consumers. 192 Evaluation. All the texts that state interest and/or show an active research towards the brand or product. For example: My daughter and I are looking for a Fiat-like van in good condition The annotator should also annotate as evaluation all the texts that express a preference (positive or negative) and that we cannot infer user experience from them. For instance: Well, I’d rather ﬂy with Emirates than with Ryanair Purchase. All the texts that explicitly express the decision to buy are generally conveyed in ﬁrst person and in future tenses. Texts that refer to the exact moment of the purchase also belong to this stage. For example: The car is in the authorized dealer, I’m buying it tomorrow Post-purchase. All the texts that explicitly refer to a past purchase and/or an actual user experience, are generally expressed in ﬁrst person, in present as well as in past tenses. Texts that convey the possession or the use of some product are also annotated as “post-purchase”, although there is no opinion about it. Some examples: We went on the Mazda I bought a 2002 Jaguar two days ago I’ve been using a pair of Nike for the past two years, and I’m delighted However, not all the texts in the corpus clearly pertained to one of the Consumer Decision Journey categories. It was obvious that a great amount of the texts did not imply user experience, or the stages appeared mixed. Therefore, we established two other categories under which the human annotators could tag the texts: ambiguous and no corresponding. The speciﬁc instructions to annotate these kinds of texts are the following: 193 Ambiguous. All the texts where the author recommends or criticises the product or brand but they do not imply active evaluation or user experience. Also, all the texts in which one cannot distinguish if the author is expressing a post-purchase experience or an evaluation, or all those texts where the author explicitly recommends some product or brand. For instance: I want the Mazda I love the clothes from Zara I advise you to buy this Bimbo bread Ambiguous texts are discarded from the gold standard. No corresponding. All the texts that contain news headlines or corporative or informative messages about the brand or product, without user’s opinions or statements. Also belong to this category all the questions where one cannot infer user experience, evaluation, or purchase intention, texts that express user experience, evaluation or purchase intention of a third person, and texts that imply the sale of the product and do not contain user experience. Some examples: Nike opens its ﬁrst shop in Madrid My father bought the gasoline 1.6 gls full Land Rover car year ’99 for sale In the experiment conducted in this thesis, two experts on marketing annotated each text as belonging to one of the four Consumer Decision Journey stages (i.e. awareness, evaluation, purchase or post-purchase). All the annotations were then checked by one reviewer with social sciences background and by two reviewers with computational linguistics background. The consensus between annotators was sought during the execution of this process. 194 8.2.1.2 Describe Data Task Regarding data format, the dataset used has a structure containing the text gathered, plus other metadata, and its classiﬁcation. The data schema –a view of the Social Graph Ontology with the ontology elements required by this technique– is shown in Figure 8.5. The classes and properties included in the diagram have been already described in Chapter 5. Regarding its volume, the dataset used for modelling and evaluating the technique for detecting Consumer Decision Journey stages (i.e. the gold standard) consists in 13,980 opinions written in English and 22,731 opinions written in Spanish. The length of the texts ranged from 2 to 194 words. The texts were collected from ﬁve diﬀerent social media sources (forums, blogs, reviews, social networks, and microblogs) and refer to diﬀerent domains: automotive industry, banking, beverages, sports, telecommunication, food, retail and utilities. The opinions were selected by looking for a set of 72 particular trademarks of the diﬀerent domains (or business sectors). cd Data Format for the Consumer Decision Journey Identification Technique (trademark) sioc:Forum isocat:datcat isocat:DC-414 * skos:Concept dcterms:type * * * marl:describesObject sioc:has_container * sgo:PurchaseStage sgo:hasPurchaseStage 0..1 * * marl:extractedFrom marl:Opinion marl:optinionText (language-annotated) * sioc:Post * * sgo:awareness sioc:topic sgo:evaluation sgo:postpurchase isocat:DC-2212 * isocat:datcat skos:Concept sgo:purchase (domain) * * Figure 8.5: Format of the data used by the technique for detecting Consumer Decision Journey stages 195 8.2.1.3 Explore Data Task This task characterises the data from diﬀerent viewpoints to ensure that the gold standard is richer enough for model learning purposes. Speciﬁcally the objective of this task is to describe the distribution of the data with respect to media sources, business sectors, and Consumer Decision Journey categories. Figure 8.6 shows the distribution of the texts along the media sources and business sectors for which the data used in our experiments were gathered, while Figure 8.7 shows the distribution of texts along the Consumer Decision Journey categories. 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Social Networks Reviews Microblogs Blogs Forums Automotive 678 420 2488 219 673 Banking 122 647 2165 608 652 Beverages 746 7 6792 768 54 Sports 778 351 2671 498 731 Telecom. 809 661 1940 553 720 Food 0 0 3876 371 23 Retail 248 170 3140 110 72 Utilities 9 0 1871 55 5 Figure 8.6: Distribution of the texts along the media sources and sectors for the Consumer Decision Journey gold standard 196 1800 1600 1400 1200 1000 800 600 400 200 0 Automotive Banking Beverages Sports Telecom. Food Retail Postpurchase 474 411 327 886 1395 340 569 Utilities 411 Purchase 55 12 53 113 66 34 242 167 Evaluation 125 42 29 116 82 49 99 39 Awareness 195 146 514 138 182 293 89 60 Figure 8.7: Distribution of the texts along the Consumer Decision Journey categories 8.2.1.4 Verify Data Quality Task This task examines the quality of the data, ensuring that the gold standard is valid enough for modelling the classiﬁer. Thanks to the variety of sectors selected, it was possible to have a crossdomain and cross-source perspective, being able to carry out generalisations on the linguistic rules proposed and studying the relation among diﬀerent stages, product typology and number of texts produced. All the texts of the corpus were written by users of diﬀerent sites and social media, thus we found a lot of grammatical errors and misspellings that supposed additional diﬃculties to pattern identiﬁcation. Moreover, all the texts were in English and Spanish but with diﬀerent geographical language varieties in both cases (e.g. American Spanish, European Spanish, American English, British English), thus some lexical units were especially hard to detect. We observed that there is a general tendency to comment or analyse the quality and features of expensive or high involvement products while cheaper ones 197 received much less feedback. Particularly, in the case of cars, mobile providers or sportive clothes and shoes (sectors Automotive Industry, Telecommunication, and Sports, respectively), we appreciated that customers tend to write more evaluative texts, investigating the pros and cons of diﬀerent brands before buying them. Users are also inclined to comment their personal experiences with the product after using it. Accordingly, it is more diﬃcult to ﬁnd evaluative messages about consumer-packaged goods such as beverages or food whose cost is typically much lower. In these cases, consumers require less deliberation, show less involvement, and they usually do not compare these products with their competitors before purchasing them. However, in the case of cheaper products, consumers tend to pay much more attention to the advertising campaigns (awareness). Correspondingly, the number of comments about their post-purchase experience is also lower in this kind of products. As it can be seen in Figure 8.7, the number of texts per category is unbalanced along the diﬀerent stages of the Consumer Decision Journey for the diﬀerent business sectors. Despite these diﬀerences across domains, we consider the corpus varied enough for learning and evaluating the technique for identifying Consumer Decision Journey stages, since it consists in an random sample of the posts produced for the domains being monitored, and the overall volume of texts for each stage is adequate for learning and evaluation purposes. In order to estimate how reliable the annotation was, an excerpt of the classiﬁed corpus (1,000 texts) along with the annotation criteria were given to a group of annotators through the Amazon Mechanical Turk93 annotation services (see question 3 in Figure 8.8 for an example). Each text was classiﬁed by two diﬀerent anonymous human annotators and compared against the annotation in the gold standard. To measure the inter-annotator agreement we chose Fleiss’ kappa metric [Fleiss, 1973], which takes the value of 1 for a perfect matching between the annotators and 0 (or a negative number) if the matching is the same as (or worse than) expected. In our case, the value for this metric was 0,503, which is generally regarded as a moderate value. 93 http://www.mturk.com 198 Figure 8.8: Example annotation of a post according to a Consumer Decision Journey category using Amazon Mechanical Turk 199 8.2.2 Modelling Activity The goal of this activity is to develop an automatic classiﬁer for identifying Consumer Decision Journey stages within user-generated content. This activity consists in the ordered execution of the following tasks. 1. The Select Modelling Technique task consists in selecting an describing a modelling technique for being applied for identifying Consumer Decision Journey stages from user-generated content. 2. The Build Model task consists in implementing a rule set against which the posts will be matched in order to identify the Consumer Decision Journey stages. Next, each of these tasks are described. 8.2.2.1 Select Modelling Technique Task The goal of this technique is to perform a classiﬁcation of an arbitrary content into zero or one Consumer Decision Journey stages. For doing so, this technique relies on the rule-based modelling technique described in Section 8.1.3. Therefore, the resulting classiﬁer matches the textual content received against a rule set, outputting a set of numeric values associated to each of the four categories, meaning a value distinct from zero that the post is classiﬁed according to its corresponding Consumer Decision Journey stage. As described in Section 8.1.3, the selected rule-based classiﬁcation technique may output several candidate classiﬁcation categories. However, as the output of the Consumer Decision Journey classiﬁer must consist on a unique category, the following heuristic is executed after rule matching: (a) If the text is classiﬁed into one Consumer Decision Journey stage (i.e. only one category has a value distinct from zero), then the classiﬁer outputs such category. (b) If the text is classiﬁed into more than one Consumer Decision stage (i.e. more than one category has a value distinct from zero), then the one that 200 corresponds to the latest stage in the Consumer Decision Journey workﬂow (shown in Figure 2.6) is selected, discarding the rest of the classiﬁcations. (c) If the text cannot be classiﬁed into any of the stages (i.e. all the categories have a zero value associated), then the classiﬁer ﬁnishes without returning a classiﬁcation. 8.2.2.2 Build Model Task This task consists in the development of the rule set capable of recognising fragments of text from which a stage of the Consumer Decision Journey can be derived, therefore classifying the social media posts which embed such fragments of text according to the stage detected. Although this task has been mainly executed by researchers of the group Technologies of Language Resources (TRL) of the Institut Universitari de Ling¨ u´ıstica 94 Aplicada of the Universitat Pompeu Fabra , we include its description in this thesis for self-containment purposes. The result of the joint work regarding the identiﬁcation of Consumer Decision Journey stages in user-generated content has been published by V´azquez et al. [2014]. A set of linguistic patterns was compiled by studying the gold standard in order to distinguish among the diﬀerent stages of the Consumer Decision Journey. The developed classiﬁer was based on the recognition of these particular linguistic expressions. Linguistic rules were built as to match the occurrence of a lemma and its synonyms or antonyms (to increase recall); the particular context where they could occur is used as a restriction. The description of the context includes morphosyntactic information as obtained with the tagger. The inclusion of morphosyntactic information allows to diﬀerentiate, for example, between “I bought” that is an expression related to postpurchase and “I’m buying” related to purchase. Some examples of linguistic patterns for matching Consumer Decision Journey stages are given in Table 8.1. For example, the ﬁrst pattern matches the gerund form of the verb “to laugh”, followed by a preposition (any), the word “a” and the lemma “commercial” at a maximum distance of one word. 94 http://www.iula.upf.edu/trl/rpresuk.htm 201 Language English Spanish Linguistic Pattern laugh#VBG [IN] “a” /1/ commercial wonder if ENTITY [MD] oﬀer i “will” buy i call#VBD /1/ customer service [PP1] [VA] gustar [DI] v´ıdeo estar#V IP1 buscar#V G ir#V I 1S “a” pillar [D] [PP1] quedar#V I 1 con ENTITY CDJ Stage Awareness Evaluation Purchase Postpurchase Awareness Evaluation Purchase Postpurchase Table 8.1: Examples of the linguistic patterns for identifying Consumer Decision Journey stages In the development of these linguistic patterns, we started by looking for the most frequent content words, bigrams and trigrams in the texts of each stage trying to relate them to just one of the phases, but the results were not satisfactory. On the one hand, the most frequent bigrams and trigrams did not help to clearly identify any speciﬁc stage. On the other hand, content words used individually allowed us to identify some portions of texts as belonging to one of the stages of the Consumer Decision Journey, but the recall and precision were very low. Therefore, we decided to use these lexical elements (i.e. the most frequent content words) as starting point to build sets of more restrictive rules that included morphosyntactic features, functional words, and synonyms and antonyms. The inclusion of morphosyntactic tags allowed us to easily diﬀerentiate, for example, between “I bought” used in postpurchase experience and “I will buy” that classiﬁes into the purchase stage. The introduction of functional words permitted us to identify more complex expressions, as for example, “I’m going to buy” or “thinking of buying something”. Finally, with the use of synonyms, antonyms and other meaning-related words, we could increase the recall of our system. In order to identify the morphological variations of the tokens, we used the lemmas of the most frequent words (if we needed the exact word we put inverted commas round it). This avoided us to create a pattern for each form of the word. Additionally, we added morphosyntactic tags to specify what tense of the verb or what morphological element we wanted to identify. Diﬀerent heuristics for engineering the rules for every stage are discussed next. 202 Identifying Awareness. As commented in previous sections, in the texts belonging to the awareness stage authors tend to comment, criticise or talk about their experience with respect to speciﬁc advertising campaigns or promotions of the selected product or brand. Therefore, the rules that we created to identify sentences pertaining to this stage (996 for English and 65 for Spanish) mostly rely on particular lexical items belonging to the advertisement word family. Some examples are: “advertisement”, “campaign”, “promotion”, “video”, “sign”, etc. In the initial analysis of this kind of texts, we created more restrictive rules, matching longer portions of text, however further analysis of the classiﬁer results showed that, when using more lexicalised and less restrictive rules (with a small set of part-of-speech tags and functional words), the ﬁnal results of the classiﬁer were equal or even better. Identifying Evaluation. Rules designed to identify evaluative texts (440 for English and 167 for Spanish) showed more complexity than those created to distinguish awareness. For this Consumer Decision Journey stage, rules are longer and contain more morphosyntactic information, although the weight of the lexical elements continues to be high. Generally, the rules of this class are more restrictive than those for awareness. Since in this step the user tends to compare products or brands, a great amount of the rules identify comparative constructions. For example: “all the best /1/” or “more [AQ] than”. There are also rules which incorporate speciﬁc vocabulary usually used to convey preference or comparisons such as “stand out”, “prefer”, “recommend” and “suggest”. Identifying Purchase. For this stage we deﬁned 1,267 rules for English and 906 rules for Spanish. Generally, users tend to write a lot of comments before and after purchasing some product but the number of remarks about the speciﬁc moment of the transaction is low. Additionally, the number of diﬀerent ways to express this speciﬁc stage is also shorter with respect to other stages. We identiﬁed a set of verbs, generally expressed in future 203 tenses, whose meaning is related to “buy” or imply a purchase: “acquire”, “hunt down”, “reserve”, “try”, “grab”, etc. Identifying Postpurchase Experience. This is the stage with the most complex rules (710 for English and 769 for Spanish). We found that there is a strong relation between the type of product and the linguistic expression of the postpurchase experience, being ambiguous in many of the cases. In consequence, for this stage, we decided to build rules with a considerable amount of morphosyntactic information (to consider past tenses of the verbs, for example) and lexical elements related to postpurchase customer services (e.g. “complaint”, “unsubscribe”). The rules have been deﬁned for being used within the technique described in Section 8.1.3, thus expressed according to the grammar shown in Listing 8.1. The objective of the classiﬁer is to obtain the Consumer Decision Journey category according to which a social media post can be classiﬁed. Therefore all the rules consist on a linguistic pattern to be matched and a classiﬁcation action oriented to make the numeric value associated to a category distinct from zero whenever the linguistic pattern is matched, meaning that a text could be classiﬁed in a given Consumer Decision Journey stage. Therefore, from all the possible numeric operations that can be modelled with the rules grammar, this task only make use of addition operations, speciﬁcally adding one unit to the category for which a pattern has been matched. An example of a linguistic rule obtained by this activity is shown next. about [TO] get /2/ tablet → PURCHASE + 1 204 8.3 Technique for Detecting Marketing Mix Attributes In order to achieve one objective of our research, i.e. to develop a technique for automatically classifying short user-generated texts into one or more the Marketing Mix categories, we have carried out the same activities as in the previous technique (Data Understanding, Data Preparation, and Data Modelling), which are described next. 8.3.1 Data Understanding Activity As in the previous technique, this activity consists in the ordered execution of the tasks Collect Initial Data, Describe Data, Explore Data and Verify Data Quality, which are described next. 8.3.1.1 Collect Initial Data Task This task applies the approach described in Section 8.1.1 for retrieving textual contents mentioning commercial brands from diﬀerent social media and constructs the gold standard required for model creation and evaluation. After retrieving the corpus, this task generates the gold standard used for modelling and evaluating the Marketing Mix classiﬁer. For doing so, human annotators are asked to tag each text according to the following instructions: Quality. All the texts that refer to the quality, performance, or positive or negative characteristics of a product that aﬀect its user experience. For example: Converse are extremely uncomfortable from the moment you put them on Design. All the texts that include a reference about speciﬁc traits or features of the product such as size, colour, packaging, presentation, and styling. For example: Anybody notices the car? GQ’s design collaboration with Citroen 205 Customer Service. All the texts that refer to the responsiveness and service given by companies to customers in every stage of the Consumer Decision Journey. Also texts that refer to technical and post-purchase support to current and prospective customers. For example: @MissTtheTeacher hiya, nope, I’m not through there. I’ve been on at that Scottish Power mob for weeks. Their customer service is laughable Point of Sale. All the texts that include a mention to the physical place where the product can be found and purchased. Similarly, texts that convey difﬁculty with ﬁnding the product in the right distribution channels such as supermarkets, stores, outlets, dealerships, and stations. For example: About to spend mad money at this Nike store! Promotion. All the texts that refer to marketing strategies oriented to increase demand such as contests, freebies, coupons, competitions, discounts, gifts, and oﬀers. For example: @Jennorocks lego promotion on at Shell garages :) Price. Texts that refer to the cost, value or price of the product. It may also comprise texts that refer to speciﬁc price promotion such as discounts and price cut, in which case the text should be annotated as Price and also as Promotion. This category also includes texts with numerical references to product prices. Some examples: This Volkswagen I got my eye on is so sexy & it’s an aﬀordable price @carllongs on lighter hearted note soreen on oﬀer at tesco! 80p £1.47 and four slices have holes in them?! What on earth warburtons http://t.co/S9jSKS3LMo 206 Sponsorship. Texts that refer to awards, competitions, teams, foundations, persons, charity fundraising, concerts and alike events which are organised, endorsed or ﬁnancially supported by the company or brand. Some examples: Breaking News Sainsbury’s becomes title sponsor of the ﬁrst Sport Relief Games School event this morning was sponsored by Scottish Power. Thinking of charging an extra 10% without telling them Advertisement. All the texts that include a reference to public, paid brand announcements or messages broadcasted in the media or placed in outdoor settings. Some examples: These tv adverts are great aren’t they, Rory “interestin” McIlroy on Santander, and best of of all Kerry Katona on pay day loans, priceless! The lidl ad on Rte Two just now had delicious written on the screen. Surely its delicious? or is it subliminal advertising. #lidl As in the previous technique, two experts on marketing annotated each text as belonging to one or more Marketing Mix elements (i.e. quality, design, point of sale, customer service, sponsorship, advertisement, promotion and price), and the annotations were then checked by one reviewer with social sciences background and by two reviewers with computational linguistics background, seeking consensus between annotators and reviewers. 8.3.1.2 Describe Data Task Regarding data format, the data schema used by this technique is analogous to the one used in the previous technique, but including Marketing Mix annotations instead of Consumer Decision Journey ones (see Figure 8.9.). The opinions used by the experiment conducted in this thesis were the same as the one used for the technique for detecting Consumer Decision Journey stages. Therefore, its characteristics regarding volume are the ones described in Section 8.2.1.2. 207 cd Data Format for the Marketing Mix Identification Technique (trademark) sioc:Forum isocat:datcat isocat:DC-414 * skos:Concept dcterms:type * * * marl:describesObject * sgo:hasMarketingMixAttribute * * marl:extractedFrom marl:Opinion sgo:MarketingMixAttribute sioc:has_container * marl:optinionText (language-annotated) * * sioc:Post * sgo:design sgo:quality sgo:sponsorship sgo:customerService sgo:price sgo:promotion sgo:advertisement sgo:pointOfSale sioc:topic isocat:DC-2212 (domain) * isocat:datcat skos:Concept * * Figure 8.9: Format of the data used by the technique for detecting Marketing Mix attributes 8.3.1.3 Explore Data Task As the corpus used for learning the classiﬁer used by this technique and evaluating it is the same used for the technique for identifying Consumer Decision Journey stages in user-generated content, its distribution across social media sources and business sectors is the same (i.e. the one shown in Figure 8.6). 8.3.1.4 Verify Data Quality Task In the construction of the corpus we could observe the diﬃculty of ﬁltering texts by their belonging to one of the Marketing Mix categories; the great majority of the texts are irrelevant for our classiﬁcation given that just a small group of them implies Marketing Mix elements (25% of the corpus). Nevertheless, we consider the corpus varied enough for learning and evaluating this technique, since it consists in an random sample of the posts produced for the domains being monitored and the overall volume of texts for each stage is adequate for learning and evaluation purposes. As in the previous technique we used Amazon Mechanical Turk for estimating annotation reliability (see question 2 in Figure 8.10 for an example). The value for Fleiss’ kappa was 0,397, which is generally regarded as a fair value. 208 Figure 8.10: Example annotation of a post according to a Marketing Mix Category using Amazon Mechanical Turk 209 8.3.2 Modelling Activity The goal of this activity is to develop an automatic classiﬁer for identifying Marketing Mix attributes within user-generated content. This activity consists in the ordered execution of the following tasks. 1. The Select Modelling Technique task consists in selecting and describing a modelling technique for being applied for identifying Marketing Mix Attributes from user-generated content. 2. The Build Model task consists in implementing a machine-learning classiﬁer identiﬁes the Marketing Mix attributes. Next, each of these tasks are described. 8.3.2.1 Select Modelling Technique Task In order to automate the classiﬁcation of texts based on the Marketing Mix elements conveyed in them, this technique makes use of the Decision Tree (DT) modelling technique deﬁned by Quinlan [1993]. Speciﬁcally one binary classiﬁer per Marketing Mix category is trained. Each binary classiﬁer determines whether the post belongs or not to a given Marketing Mix category. Therefore, the classiﬁcation for each category is made between the positive class (for example, Advertisement) and the negative class (for example, No Advertisement). As a given text can belong to more than one category due to the use of multiple binary classiﬁers, we built a multi-category classiﬁer that combines all the binary classiﬁers in a process that iteratively identiﬁes the set of Marketing Mix attributes expressed in each text, returning the set of Marketing Mix attributes for which its corresponding binary classiﬁers outputted a positive class. We also tried to use classiﬁers based on the Logistic Regression model [le Cessie and van Houwelingen, 1992] but the results were better with the DT classiﬁers in terms of precision and recall. Additionally, DT shows relevant features for classiﬁcation and therefore, is easily interpretable by humans. This fact made the results of these classiﬁers very useful for ﬁnal visualisation and human consumption purposes. In order to create real-life applications in the marketing ﬁeld, this 210 is a very important feature, being able to visually show consumers of marketing agencies the criteria followed for text classiﬁcation. Moreover, the DT model can also be manually revised in order to remove terms that can appear as relevant features due to biased samples. For example, “trainer” appeared as one of the discriminative features to decide if a text belongs to the “design” category for the sports domain. With the direct visualisation we could identify and eliminate it. 8.3.2.2 Build Model Task This task consists in applying a machine-learning technique for learning the automatic classiﬁer for identifying Marketing Mix attributes in the content generated by consumers. For doing so, the task executes the following steps: 1. Build Learning Datasets. This step constructs individual learning datasets for each Marketing Mix category, as each individual classiﬁer is trained with its own corpus containing positive and negative examples for a given category. In the experiment conducted in this thesis, we built a dataset with all the texts manually annotated as belonging to a given category (advertising, customer service, design, point of sale, price, promotion, quality, and sponsorship) as positive examples. For each category, we also utilised all the texts that do not belong to that given category as negative examples. The size of the datasets ranged between 85 and 1046 texts for the positive examples. 2. Part-Of-Speech Tagging. This step consists in tokenising, lemmatising and annotating the texts with their corresponding part-of-speech tags, as described in Section 2.6.1. In our experiments, for executing this step we made use of Freeling. 3. Filter Stop-Words. This step consists in removing a list of stop-words from the list of tuples outputted by the previous task, by attending to their lemmas and part-of-speech tags. Such stop-words include not only functional words but also brands and proper nouns. 211 The output of this task consists only of the lemmas of adjectives, verbs (with the exception of auxiliary verbs) and common nouns, considering the rest of categories irrelevant or less important for the identiﬁcation of the Marketing Mix attributes. 4. Features Vector Construction. This step receives the ﬁltered output of the previous task and generates a vector of features. We adopted a bag-of-words approach where words occurring in texts are used as features of a vector. Thus, each text is represented as the occurrence (or frequency) of words in it. This approach embodies the intuition that the more frequent the word is in the texts of the class (i.e. Marketing Mix element selected), the more representative it is of the content and therefore of the class. 5. Features Selection. This step applies a chi-square feature selection method in order to reduce vector dimensions by selecting the more relevant features. The idea behind this feature selection method is that the most relevant words to distinguish positive examples are those that are distributed most diﬀerently in the positive and negative class examples. 6. Model Training. This step uses the vectors previously created for learning a set of C4.5 [Quinlan, 1993] decision tree classiﬁers as implemented in Weka [Hall et al., 2009]. The results for the negative class are generally much better than those obtained for the positive class due to the larger number of texts of the negative class used to train the classiﬁers. However, as the main objective of our work is being able to introduce this tool in a real marketing scenario, we ﬁnd that it is preferable to classify a text in a negative class if the classiﬁer does not ﬁnd enough cues than to erroneously classify it in a positive class. 8.4 Technique for Detecting Emotions In order to achieve one objective of this thesis, i.e. to develop a technique for automatically classifying short user-generated texts into one or more emotions, 212 we have carried out the same activities as in the previous techniques, which are described next. 8.4.1 Data Understanding Activity This activity consists in the ordered execution of the same tasks that were executed for the previous techniques. These tasks are explained next. 8.4.1.1 Collect Initial Data Task This task applies the approach described in Section 8.1.1 for retrieving textual contents mentioning commercial brands from diﬀerent social media, and constructs the gold standard required for model creation and evaluation. In this task several people participated with diﬀerent background knowledge. The gold standard is created by annotating the corpus gathered according to the conceptual framework deﬁned in Section 2.5.3. Annotators were asked to tag each text with zero or more labels. In order to understand the sentiments involved in each category —and to help to annotate the corpus—, we have speciﬁed the secondary sentiments related to each of them. The set of sentiments is based on a reformulation of Richins [1997] and Shaver et al. [1987]; there is a list of them for each sense within a category (see Table 8.2). In the experiment conducted in this thesis we gathered a corpus of posts written in Spanish about several commercial brands from various social media and diﬀerent business/market domains. The manual annotation of the texts was carried out ﬁrst by a person who annotated the resulting corpus of the Data Gathering Activity according to the conceptual framework of emotions/sentiments (see Section 2.5.3). This person followed some speciﬁc guidelines (e.g. if a secondary sentiment in Table 8.2 was identiﬁed for a text, then it was classiﬁed under its corresponding basic sentiment). This annotation process was supervised by two more persons, who examined the annotations and discussed them with the annotator in case of disagreement. They came from diﬀerent backgrounds, though in close relation to the project ﬁeld: the annotator had an advertising and public relations background, one of the reviewers was an expert in social sciences and the other one was from the computational side. 213 Primary Trust Satisfaction Happiness Love Fear Dissatisfaction Sadness Hate Secondary Sentiments - Optimism, Hope, Security - Fulﬁlment, Contentment - Joy, Gladness, Enjoyment, Delight, Amusement - Joviality, Enthusiasm, Jubilation - Pride, Triumph - Passion, Excitement, Euphoria, Ecstasy - Nervousness, Alarm, Anxiety, Tenseness, Apprehension, Worry - Shock, Fright, Terror, Panic, Hysteria, Mortiﬁcation - Dislike, Rejection, Revulsion, Disgust - Irritation, Aggravation, Exasperation, Frustration, Annoyance - Depression, Defeat, Unhappiness, Anguish, Sorrow, Agony - Melancholy - Disappointment, Hopelessness, Dejection - Shame, Humiliation, Guilt, Regret, Remorse - Alienation, Isolation, Loneliness, Insecurity - Rage, Fury, Wrath, Hostility, Ferocity - Bitterness, Resentment, Spite, Contempt, Vengefulness - Envy, Jealously Table 8.2: Primary and secondary sentiments 8.4.1.2 Describe Data Task Regarding data format, the data schema used by this technique (see Figure 8.11) is analogous to the one used in the previous techniques, but including emotion annotations instead of Consumer Decision Journey or Marketing Mix ones. Regarding volume and other gross attributes of the texts gathered, the corpus we have used in our experiments is made up of 26,505 texts (709,095 words) in Spanish taken from diﬀerent channels including blogs, forums, microblogs (specifically, from Twitter), product review sites, and social networks (speciﬁcally, from Facebook). These texts are related to several brands belonging to nine business sectors. Their choice is based on their relevance for the media agency that participated in this work, Havas Media Group95 , and on the number of opinions that they generate according to their social media monitoring tools. These domains also constitute a representative set of both, low-involvement and highinvolvement products —i.e. products which are bought frequently and with a 95 http://www.havasmg.com 214 cd Data Format for the Emotions Identification Technique (trademark) sioc:Forum isocat:datcat isocat:DC-414 * skos:Concept dcterms:type * * * marl:describesObject * onyx:hasEmotionCategory * * marl:extractedFrom marl:Opinion onyx:EmotionCategory sioc:has_container * marl:optinionText (language-annotated) * * sioc:Post * sgo:satisfaction sgo:dissatisfaction sgo:love sgo:hate sgo:happiness sgo:sadness sgo:trust sgo:fear sioc:topic isocat:DC-2212 (domain) * isocat:datcat skos:Concept * * Figure 8.11: Format of the data used by the technique for detecting emotions Social Media Type Blogs Forums Forums Review sites Social Networks Distribution of texts 19% 18% 39% 10% 14% Table 8.3: Distribution of texts for the sentiment corpus by social media type minimum of thought and eﬀort (e.g. soft drinks) and products for which the buyer is prepared to spend considerable time and eﬀort (e.g. cars)—, as well as of products with diﬀerent cost. 8.4.1.3 Explore Data Task This task characterises the data from diﬀerent viewpoints to ensure that the gold standard is richer enough for model learning purposes. Speciﬁcally the objective of this task is to describe the distribution of the data with respect to media sources, business sectors, and emotion categories. The distributions of the texts in the gold standard by social media type and by business sector are shown in tables 8.3 and 8.4 respectively. According to the resulting annotation, only 27% of the texts could be said to express a sentiment (14% expressed satisfaction, 13% expressed dissatisfaction, 215 Domain Foods Automotive industry Financial services Drinks Cosmetics Sports Insurance companies Telecommunication services Tourism Number of brands 4 10 10 3 6 2 12 11 7 Distribution of texts 7% 10% 11% 24% 7% 12% 11% 10% 8% Table 8.4: Distribution of texts for the sentiment corpus by domain 1% expressed trust, 1% expressed fear, 1% expressed happiness, 0.5% expressed sadness, 2% expressed love, and 3% expressed hate)96 . The remaining 73% was annotated as neutral regarding sentiments. 8.4.1.4 Verify Data Quality Task An excerpt of the classiﬁed corpus (300 texts) along with the annotation criteria was given to a new annotator. This allowed us to estimate how reliable the manual annotation was. To measure the inter-annotator agreement we chose Cohen’s kappa metric [Cohen, 1960], which takes the value of 1 for a perfect matching between annotators and 0 (or a negative number) if the matching is the same as (or worse than) expected. In our case, the value for this metric was 0,511, which is generally regarded as a moderate value. Additionally, another excerpt of the classiﬁed corpus (1,000 texts) along with the annotation criteria were given to a group of annotators through the Amazon Mechanical Turk annotation services (see question 1 in Figure 8.12), as we did in the previous techniques. Each text was classiﬁed by two diﬀerent anonymous human annotators and compared against the annotation in the gold standard. To measure the inter-annotator agreement we chose Fleiss’ kappa metric (while Cohen’s metric evaluates the agreement between two annotators, Fleiss’ metric let us evaluate the agreement for more annotators). The value for this metric was 0,415, which is also regarded as a moderate value. 96 The reason why the addition of these percentages is over 27% is the subsumption by SD. 216 Figure 8.12: Example annotation of a post according to a Emotions category using Amazon Mechanical Turk 217 8.4.2 Modelling Activity The goal of this activity is to develop an automatic classiﬁer for identifying emotions within user-generated content. This activity consists in the ordered execution of the following tasks. 1. The Select Modelling Technique task consists in selecting an describing a modelling technique for being applied for identifying emotions within usergenerated content. 2. The Generate Test Design task consists in generating a mechanism to test the model for quality and validity. 3. The Build Model task consists in implementing a rule set against which the posts will be matched in order to identify the emotion categories. Next, each of these tasks are described. 8.4.2.1 Select Modelling Technique Task The goal of this technique is to perform a classiﬁcation of an arbitrary content into zero or more emotion categories. For doing so, this technique relies on the rule-based modelling technique described in Section 8.1.3. Therefore the resulting classiﬁer matches the textual content received against a rule set, outputting a set of numeric values associated to each of the four sentiment polarities, meaning a value greater than zero that the post is classiﬁed in the positive category corresponding to a given polarity, and a value lower than zero that the post is classiﬁed in the negative category for a given polarity. Then, the numerical values are discretised to obtain the speciﬁc sentiment categories in which the text has been classiﬁed, i.e. a positive value corresponds to the positive emotion of a category and a negative value to the negative one (see Table 2.4). If the value of a category is 0, the text is neutral with respect to that category. 218 Domain Foods Automotive industry Financial services Drinks Cosmetics Sports Insurance companies Telecommunication services Tourism Number of texts in the training set 592 86 411 284 572 451 334 460 408 Number of texts in the evaluation set 995 2,657 1,214 2,106 828 2,892 1,050 2,601 999 Table 8.5: Distribution of texts for the sentiment corpus for the training and test sets by domain 8.4.2.2 Generate Test Design Task In the experiment performed in this thesis the annotated corpus was used to train and evaluate the system. The training set used to create the rules contained a sample of 80% of the texts annotated with a sentiment in the corpus (i.e. 13% of the whole corpus), while the evaluation set contained a sample of 58% of the corpus; both samples were made up of randomly-chosen texts. Table 8.5 shows the number of texts collected for each domain, the number of brands that have been monitored for each domain, and the number of texts that have been considered in the training and evaluation sets. Finally, the quality measures used for evaluating the classiﬁer are the ones described in the evaluation section (see Section 8.7.2.3). 8.4.2.3 Build Model Task The goal of this task is to learn a classiﬁer for analysing the sentiment of usergenerated content. For doing so, this task engineers a rule capable of recognising fragments of text from which a consumer emotion can be derived, therefore classifying the social media posts which embed such fragments of text according to the emotion detected. This task has been mainly executed by a team of the Ontology Engineering 219 Group of the Universidad Polit´ecnica de Madrid97 , in which the author of this thesis was not involved. However, the description of this task is included for selfcontainment purposes. The result of the joint work regarding the identiﬁcation of Consumer Decision Journey stages in user-generated content has been published by Aguado de Cea et al. [2014]. The classiﬁcation rules were compiled by studying the gold standard, as well as by reusing two existing linguistic resources: Badele3000 [Bernardos and Barrios, 2008] and Cal´ıope [Aguado de Cea and Bernardos, 2007]. Such resources are described next. Badele3000. Badele3000 is a domain independent lexical-semantic database with information about the 3,300 most frequent nouns in Spanish. The theoretical linguistic foundation of this resource is the Meaning-Text Theory (MTT) [Mel’ˇcuk, 1996], specially the concept of Lexical Function (LF), which relates two lexical units (the base and a certain value of the LF for that base) accounting for the paradigmatic relations and the syntagmatic relations (or collocations98 ) between those lexical units. For example, if the base is “rain”, the relation of intensiﬁcation is expressed by “heavy”, i.e. its magniﬁed (intensiﬁed) form is M agn(rain) = heavy, while the magniﬁed value of “wind” is M agn(wind) = strong. These data let us know that rain goes with “heavy” but wind goes with “strong” and that these are typical collocations of the English language to express that rain and wind are intense. The database contains more than 20,000 linguistic collocations. Additionally, lexical units are organised in a hierarchical structure in which each lexical unit is classiﬁed according to a semantic label (SL) hierarchy, which usually corresponds to the hyperonym or immediate generic term. A lexical unit ‘inherits’ the values of the LF’s deﬁned for the SL under which it is classiﬁed. Regarding those lexical units corresponding to sentiments, in Badele3000 97 98 http://www.oeg-upm.net A collocation is a partly or fully ﬁxed sequence of words established through repeated use. 220 Semantic Label Sentimiento (sentiment) Sentimiento positivo (positive sentiment) Sentimiento negativo (negative sentiment) Lemma Deseo (wish) Ansiedad (anxiety) Sorpresa (surprise) Amor (love) Felicidad (happiness) Satisfacci´ on (satisfaction) Seguridad (security) Dolor (pain) Pena (sadness) Desesperaci´on (desperation) Miedo (fear) Sufrimiento (suﬀering) Odio (hatred/hate) Inseguridad (insecurity) Table 8.6: Excerpt from sentiments in Badele3000 they are classiﬁed under the semantic label sentimiento (sentiment)99 or one of its children: sentimiento positivo (positive sentiment) and sentimiento negativo (negative sentiment) (see Table 8.6). Therefore, the next step was to obtain those lexical units (verbs, adjectives, etc.) that are values of the LFs for the SL sentimiento, its children (“positive sentiment” and “negative sentiment”) and its grandchildren (the nouns for sentiments). So, for example, we obtained verbs such as embargar (be overwhelmed by) expressing that a sentiment “exists (aﬀecting someone)” —in terms of LFs, F unc1(sentimiento) = embargar (a alguien)—, and we could infer that it also combined with the lexical units corresponding to sentiments such as tristeza (sadness), emoci´ on (emotion), alegr´ıa (happiness), etc. We also obtained collocates which are speciﬁc for particular sentiments, but cannot be used with other sentiments. For instance, “apoderarse (de alguien)” (to be possessed by) can be used with miedo (fear), but not with alegr´ıa (happiness). In this way, we automatically obtained a list of collocates of Spanish nouns for sentiments, which we could directly reuse in the creation of our rules. 99 The translations of the example into English have been made for the sake of clarity. 221 Cal´ıope. Cal´ıope is a web application designed to help learning contextualised terms in English and Spanish by, ﬁrst, providing examples of their use in context and, second, by showing the lexical-semantic relationships among them. For these purposes, it manages two resources: a corpus for Spanish and another one for English; as well as a glossary of terms for both languages. Among all Caliope’s functionalities, the ones that are noteworthy for our work are the following: • Addition of new texts to the corpus. This allowed us to include our corpus in Cal´ıope, what facilitated the retrieval of the vocabulary on sentiments. • Filtering of texts. This let us choose the texts we wanted to analyse. • Frequency of words. This facility and the part-of-speech annotation helped us to establish the most relevant words by grammatical category. We used this result as one of the starting points for creating the rules. • Concordances of a term —i.e. occurrences of a term in the texts— and co-occurrences of several terms (which are not necessarily adjacent). These functionalities provided us with the contexts of the terms we needed to examine in order to draw patterns/templates for the antecedents of our rules. The training set analysed to create the rules contained a randomly chosen sample of 80% of the texts annotated with a sentiment in the corpus (i.e. 13% of the gold standard). However, as explained before, the annotated corpus was not the only source used to create the rules; they were also based on the set of collocations of common sentiments obtained from Badele3000 and on the semantic relations (reﬂected by the LF’s) existing between them. This information was very valuable because it helped us to derive expressions in the antecedents of the rules and the sentiment category in their corresponding consequents. Table 8.7 shows some rules created for the Love-Hate (LH ) polarity. They were written after having analysed the concordances of “odio” (hate/hatred), found in the corpus via Caliope, and its collocations, retrieved from Badele3000 (see Table 8.8). 222 Meaning in Spanish mi/este odio a/por marca siento odio a/por marca (c´ omo/cada d´ıa) odio (m´ as) a (el/la/esta/...) marca marca es (muy/tan/...) odiosa Meaning in English my/this hatred against/for brand I feel hatred against brand I feel an increasing/growing hatred against/for brand What a hatred I feel against/for brand brand is (very/so/...) hateful Rules [D] odio#NC [SP] ENTITY → LH - 1 sentir#V odio#NC [SP] ENTITY → LH - 1 odiar#V a#SP /1/ ENTITY → LH - 1 odiar#V m´ as#RG a#SP /1/ ENTITY → LH - 2 c´ omo odiar#V a#SP /1/ ENTITY → LH - 2 ENTITY ser#V odioso#A → LH - 1 ENTITY ser#V muy#RG odioso#A → LH - 2 Table 8.7: Examples of rules for classifying emotions Lexical Function FinFunc0 IncepFunc0 IncepFunc0 Func1 Func1 Func1 Func1 Semantic Relation reﬂected by the LF Value Dejar de existir (L) (to stop existing) Empezar a existir (L) (to start existing) Empezar a existir (L) Afectar a algo/alguien (L) (to aﬀect sth/sb) Afectar a algo/alguien (L) Afectar a algo/alguien (L) Afectar a algo/alguien (L) IncepPredMinus IncepPredPlus Manif Oper1 Oper1 Real1-M Disminuir (L) (to decrease) Aumentar (L) (to increase) Mostrar (L) (to show) Hacer (L) (to do) Hacer (L) Hacer lo esperable (con L) (to do the expected) Hacer lo esperable (con L) Desaparecer (to vanish) Emanar (to arise) Nacer (to arise) Anidar (en algo/alguien (to nest) Palpitar (en alguien) (to beat) Latir (en alguien) (to beat) Embargar (a alguien) (to be overwhelmed by) Disminuir (to decrease) Aumentar (to increase) Mostrar (to show) Sentir (to feel) Tener (to feel) Ocultar (to conceal) Real1-M Disimular (to disguise) Table 8.8: Collocations of “odio” in Badele3000 8.5 Technique for Detecting Place of Residence The goal of this technique is to identify the place of residence of users, deﬁning “place of residence of a user” as the geographical location where a user usually lives. To achieve this goal we have carried out the same activities as with the previous technique, which are described next. 8.5.1 Data Understanding Activity This activity consists in the ordered execution of the Collect Initial Data, Describe Data and Explore Data tasks. Next we explain each of these tasks. 223 8.5.1.1 Collect Initial Data Task We have collected a corpus of users extracted from Twitter whose place of residence was known beforehand. For each user, we have extracted the location and description declared in his/her proﬁle, his/her timeline (i.e. tweets and retweets), as well as the list of followers and users followed by the user. Additionally, we have extracted the locations, descriptions and timeline of each user included in the list of followers and followed. We have restricted the number of friends for each user to 20 (10 followers plus 10 persons followed by the user to be characterised), since Twitter limits the number of calls to its API. Additionally we have restricted the number of tweets analysed to 20, for the same reason, including tweets authored by the user and retweets. 8.5.1.2 Describe Data Task The dataset used has a structure containing data about 1,080 users, the content shared and published by them, and the existing relationships among them and other users. The data format also relates each user with a normalised geographical location that represents her/his place of residence, deﬁning a gold standard. Such location is deﬁned at the level of city and related with its administrative region of second level (e.g. county, province), the administrative region of ﬁrst level (e.g. state, autonomous community), and the corresponding country. Additionally, the data format relates the contents the named entities of type location extracted from them. The data schema is shown in Figure 8.13. The classes and properties included in the diagram have been already described in Chapter 5. 224 cd Data Format for the Place of Residence Identification Technique sioc:follows foaf:Agent dcterms:description * sioc:account_of * * * * sioc:UserAccount sioc:creator_of sgo:declaredLocation * * sioc:Post sioc:content (language tagged) * foaf:based_near * tzont:Region isocat:datcat isocat:DC-4339 (location) * sioc:topic * * skos:Concept Figure 8.13: Data format of the corpus used by the technique for detecting the place of residence of social media users 8.5.1.3 Explore Data Task The users in the evaluation set are distributed among 11 diﬀerent countries (Argentina, Chile, Colombia, Spain, USA, Japan, Mexico, South Africa, Switzerland, Uruguay and Venezuela). Such users share and publish content in diﬀerent languages (mainly in Spanish and English). 8.5.2 Data Preparation Activity During this activity, we have pre-processed the contents published by the users, as well as their descriptions in their proﬁles, by applying the common tasks deﬁned in Section 8.1.2. Nevertheless, we have not cleansed posts referring to particular brands during the Clean Data task, as we consider all the content relevant for extracting locations from them. 8.5.3 Modelling Activity The goal of this activity is to develop an automatic classiﬁer for detecting the place of residence of social media users. This activity consists in the ordered execution of the following tasks: 1. The Select Modelling Technique task consists in selecting and describing a modelling technique for being applied for creating the classiﬁer. 225 2. The Generate Test Design task consists in generating a mechanism to test the model for quality and validity. Next, each of these tasks are described. 8.5.3.1 Select Modelling Technique Task We have experimented with ﬁve diﬀerent approaches for detecting the place of residence of a given social media user. Such approaches are summarised next. 1. Use the metadata about locations of users included in the proﬁles of the user in social networks. 2. Analyse the friendship networks of the users for inferring their place of residence when it cannot be retrieved from location metadata. 3. Perform text mining of the descriptions written by users about themselves in their proﬁles for inferring their place of residence when it cannot be retrieved from location metadata. 4. Perform text mining of the content published and shared by social media users for inferring their place of residence when it cannot be retrieved from location and description metadata. 5. Combine the previous approach with the approach based on friendship networks into a content-based and network-based hybrid approach. Next we explain every approach. Approach based on metadata about locations of users. This approach corresponds to the one implemented by Mislove et al. [2011]. The approach makes use of the location metadata in the user proﬁle, as for example, the location attribute returned by Twitter API when querying user details100 . Figure 8.14 shows the location attribute in an example Twitter user proﬁle. Users may express their location in diﬀerent forms through this attribute, such as geographical coordinates, or the name of a location (e.g. a city, a 100 http://dev.twitter.com/docs/api/1.1/get/users/show 226 Figure 8.14: Example of user proﬁle location metadata country, a province, etc.). Therefore, a normalisation stage is required in order to obtain a standard form for each location. For normalising the location this approach makes use of a geocoding API. Our implementation uses Google Maps web services. This approach invokes a method of the geocoding API that analyses a location and returns a normalised tuple composed by a set of components that deﬁne the location, including latitude, longitude, locality, and country, among others. For example, if the request “santiago” is sent to the web service, the response will be a tuple containing “Chile” as the country and “Santiago” as the locality, among other location components. The complete list of components is listed in the API documentation101 . Please note that this query does not provide enough information for disambiguating locations, e.g. “santiago” may refer to many geographical locations, including Santiago de Chile and Santiago de Compostela (Spain). Therefore the precision of this approach 101 http://developers.google.com/maps/documentation/geocoding 227 Figure 8.15: Example of an output of the Google Geocoding API 1 2 3 4 function ResidenceF romLocationData(user) begin return GeoCode(location(user)) end Listing 8.2: Approach based on metadata about locations of users depends on how users describe their location when ﬁlling in their proﬁles. For example, geographical coordinates will deﬁne locations accurately, while combinations of city and country (e.g. “Guadalajara, Spain”) will enhance disambiguation (although not completely). In addition, this approach does not return a place of residence when users have not ﬁlled in the location ﬁeld contained in user’s proﬁle form of the social network. The approaches described next deal with these precision and coverage issues. Figure 8.15 shows an example output of Google Geocoding API, while Listing 8.2 formalises the step executed by this approach. Approach based on friendship networks. This approach exploits the inherent homophily of social networks [McPherson et al., 2001] for obtaining the place of residence of users. Listing 8.3 summarises the steps executed by 228 1 2 3 4 5 6 7 8 9 10 11 12 function ResidenceF romF riends(u) begin l ⇐ ResidenceF romLocationData(u) if l = ∅ then L⇐∅ for each f in f riends(u) do L ⇐ L ∪ {GeoCode(location(f ))} end for l ⇐ M ostF requentLocation(L) end if return l end Listing 8.3: Approach based on friendship networks this approach, which are described next. 1. Firstly, we execute the previous approach for obtaining the place of residence of a given user. If a result is obtained, the process ﬁnishes. If not, the steps described next are executed (line 3). 2. Secondly, the friends of the user in her online community are collected. After that, the location of each friend is obtained by using the geocoding API. The normalised locations obtained are appended to a list (lines 6-8). 3. Finally, the list obtained in the previous step is ﬁltered iteratively selecting on each iteration the locations that contain the value with the most frequency for a given location component, starting from the country and ﬁnishing in the city, until there is only one location in the set. First the locations whose country is the most frequent are selected, then the locations whose ﬁrst-order civil entity (e.g. a state in USA or an autonomous community in Spain) is the most frequent, and so forth. The location that remains in the list after completing the iterations is selected as the place of residence of the user. This approach ensures that the most frequent regions in the friendship network of the user are selected (line 9). Figure 8.16 shows an example of this process. 229 Figure 8.16: Example execution of table location ﬁltering process Approach based in descriptions about users. This approach exploits the description published by users about themselves in their proﬁles for obtaining their place of residence, as for example, the description attribute returned by the Twitter API when querying a user proﬁle. Listing 8.4 summarises the steps executed by this approach, which are described next. 1. Firstly, we execute the ﬁrst approach (approach based on metadata about locations of users). If a result is obtained, the process ﬁnishes. Otherwise, the steps described next are executed (line 3). 2. Secondly, we obtain the user self-description attribute. Such attribute usually consists on a sentence that has to be processed for extracting the geographical locations mentioned in the text (line 5). Figure 8.17 shows the self-description attribute in an example Twitter user proﬁle. 3. After obtaining the description of the user, we perform an entity detection and classiﬁcation process, by using an entity recognition and 230 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 function ResidenceF romDescription(u) begin l ⇐ ResidenceF romLocationData(u) if l = ∅ then desc ⇐ description(u) E ⇐ N amedEntities(desc, language(desc)) L⇐∅ for each entity in E do if isLocation(entity) then L ⇐ L ∪ {GeoCode(entity)} end if end for l ⇐ M ostF requentLocation(L) end if return l end Listing 8.4: Approach based in descriptions about users identiﬁcation component for the language detected by the Construct Data Task for the user’s description (line 6). For doing so, we make use of Freeling, which provides an entity recognition and classiﬁcation module for English, Spanish, Galician and Portuguese. Such module also implements multi-word detection, which allows recognising locations named by multiple words (e.g. “United Kingdom”). 4. After that, we ﬁlter the named entities obtained in the previous step taking only the entities that correspond to a location. Such entities are sent one by one to the geocoding API for obtaining a set of normalised locations (lines 8-12). 5. As several locations may be obtained in the previous step due to multiple named entities contained in the description, once the normalised locations have been obtained, we select only one location by following the same selection approach described in step 3 of the approach explained previously, returning one location as the place of residence of the user (line 13). 231 Figure 8.17: Example of user proﬁle description metadata Approach based in content This approach consists in mining the contents published (e.g. tweets) and shared (e.g. retweets) by users to obtain their place of residence. As performed by Cheng et al. [2010] this approach extracts the location named entities from the user-generated content. Listing 8.5 summarises the steps executed by this approach, which are described next. 1. Firstly, we attempt to execute the previous approach to obtain a location from user proﬁle metadata (line 3). If a result is obtained, the process ﬁnishes with a location. Otherwise, the process continues in the following step. 2. If the previous steps do not return a location, we obtain the textual contents published and shared by the user. We process each document obtaining a list of normalised locations mentioned in the content shared and produced by the user by applying the same entity recognition technique as in the ﬁrst approach (lines 6-13). Figure 8.18 shows an example extraction of the locations contained in the content published by a Twitter user. 232 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 function ResidenceF romtP osts(u) begin l ⇐ ResidenceF romDescription(u) if l = ∅ then L⇐∅ for each text in publications(u) do E ⇐ N amedEntities(text, language(text)) for each ent in E do if isLocation(ent) then L ⇐ L ∪ {GeoCode(ent)} end if end for end for l ⇐ M ostF requentLocation(L) end if return l end Listing 8.5: Approach based in content 3. Finally, we select the place of residence of the user from the list of locations obtained in the previous step, by applying the same location selection criteria used for the previously described approaches (line 14). Hybrid approach This approach combines the previous ones. Listing 8.6 summarises the steps executed by this approach. 233 Figure 8.18: Example of location extraction from content 1 2 3 4 5 6 7 8 9 10 11 12 function ResidenceHybrid(u) begin l ⇐ ResidenceF romP osts(u) if l = ∅ then L⇐∅ for each f in f riends(u) do L ⇐ L ∪ {ResidenceF romP osts(f )} end for l ⇐ M ostF requentLocation(L) end if return l end Listing 8.6: Hybrid approach 234 8.5.3.2 Generate Test Design Task As the technique that we propose does not perform learning from data, the whole dataset has been used for evaluation purposes. 8.6 Technique for Detecting Gender The goal of this technique is to identify the gender of social media users. To achieve this goal we have carried out the Data Understanding, Data Preparation, and Modelling activities, which are described in the following sections. It is important to remark that the research conducted for deﬁning this technique (deﬁnition, experiments, evaluation, etc.) has been co-authored with members of the Acceso Group102 . 8.6.1 Data Understanding Activity This activity consists in the ordered execution of the Collect Initial Data, Describe Data and Explore Data tasks, which are explained next. 8.6.1.1 Collect Initial Data Task We have collected a random sample consisting on authors who have written a tweet in Spanish, as well as tweets that mention those authors between 29th May 2012 and 27th March 2013, by using the Twitter API. A subset of the users collected has been manually annotated by hand with their corresponding gender by a human annotator to create a gold standard. Additionally this technique makes use of two lists of ﬁrst names that have been previously classiﬁed by gender (one list for male names, and one list for female names). These lists have been extracted from a dataset published by the Spanish National Institute of Statistics. 102 http://www.acceso.com 235 cd Data Format for the Gender Identification Technique foaf:Person foaf:givenName foaf:gender sioc:account_of * * sioc:UserAccount dcterms:references * foaf:nick * sioc:Post sioc:content (language tagged) Figure 8.19: Data format of the corpus used by the technique for detecting the gender of social media users 8.6.1.2 Describe Data Task The dataset used has a structure containing users annotated with their ﬁrst names and gender, as well as contents that mention them. The data schema is shown in Figure 8.19. The classes and properties included in the diagram have been already described in Chapter 5. 8.6.1.3 Explore Data Task The dataset of users and tweets contains 69,261 users, and their corresponding tweets written in Spanish, from which 1,509 users have been annotated with their gender in the gold standard. The gold standard includes 558 female users, 621 male users and 330 neutral users. Neutral users are those accounts that belong to an organisation of another kind of non-human agent. The lists of male and female names contain 18,697 and 19,817 ﬁrst names, respectively. 8.6.2 Data Preparation Activity During this activity, we have pre-processed the contents included in the gold standard by applying the common tasks deﬁned in Section 8.1.2. Speciﬁcally, we have performed language identiﬁcation for ﬁltering users that do not have tweets written in Spanish, and we have performed content normalisation. We do not have cleansed posts referring to particular brands during the Clean Data task, as we consider all the content relevant for extracting mentions to users from them. In addition, the lists of male and female names have been curated creating a gender dictionary, so unisex names have been excluded for classiﬁcation purposes, 236 given the ambiguity that they introduce. After the curation process (removing the ﬁrst names that appear in both lists) the male ﬁrst names list is reduced to 18,391 entries and the female names list to 19,511. Some examples of removed ﬁrst names are “Pau”, “Loreto” and “Reyes”, as they are valid for both males and females in Spain. 8.6.3 Modelling Activity The goal of this activity is to develop an automatic classiﬁer for detecting the gender of social media users. This activity consists in the ordered execution of the following tasks: 1. The Select Modelling Technique task consists in selecting and describing a modelling technique for being applied for creating the classiﬁer. 2. The Generate Test Design task consists in generating a mechanism to test the model for quality and validity. Next, each of these tasks are described. 8.6.3.1 Select Modelling Technique Task We have experimented with two diﬀerent approaches for detecting the gender of a given social media user. Such approaches are summarised next. 1. Look for the names declared in users’ proﬁles within dictionaries that associate ﬁrst names with their corresponding genders. 2. Exploit the linguistic gender concord that occurs in the Spanish language when a name is not declared in the user proﬁle. Next we describe each approach. Approach based in metadata about users. This approach exploits publicly available metadata associated with the user proﬁle. Such metadata may include the user name, as for example, the name and the screen name 237 Figure 8.20: Example of user proﬁle name metadata Twitter attributes. Figure 8.20 shows the name attribute in an example Twitter user proﬁle. The approach makes use of the gender dictionary created in the Data Preparation Activity (see Section 8.6.2). Given a user account, its name metadata is scanned within the dictionaries and, if a match is found, we propose the gender associated to the dictionary where the ﬁrst name has been found as the gender of the user. Regarding multilingualism, the gender dictionary is a language-dependent resource. However, there are many resources in the Web readily available for populating easily new dictionaries, such as the population censuses published as open data by many countries. Approach based in content. This approach exploits the information provided by mentions to users. For example, in the following tweet I’m going to visit to my uncle @Daureos to Florida, the author is providing explicit information about the gender of the user mentioned. We know that @Daureos is male because of the word “uncle” 238 written before the user identiﬁer. The same happens in English with other family relationships, such as mother or father. We propose an approach for the Spanish language that performs a dependency parsing of the text with the aim of determining the gender of the terms related with the user mentioned. Therefore, for each tweet in which the user is mentioned, we attempt to estimate the gender of the user. Note that not all mentions to users provide information for estimating their genders (e.g. “via @user” and “/cc @user” at the end of the tweet). The dependency parser used is TXALA [Atserias et al., 2005]. The steps executed by this technique are the following: 1. Firstly, we execute the technique based on user name metadata described previously. If a gender is obtained, the process ﬁnishes. 2. If a gender is not identiﬁed in the previous step, we obtain all the posts that mention the user. 3. For each post, we perform a dependency parsing. Figure 8.21 shows the dependency tree obtained from a tweet that mentions a given user. Once obtained the dependency tree, we assign a gender to the user for the post analysed according to the following heuristics: (a) If the gender of the term in the parent node of the branch where the user is mentioned is male or female, we consider that the user is male or female accordingly (e.g. “Mi t´ıo Daureos”). (b) If some of the child nodes of the node corresponding to the user mention corresponds to a term with a speciﬁc gender, we consider that the gender of the user corresponds to the gender of such terms (e.g. “Vio a Daureos enfermo y triste”); (c) If there is a noun adjunct as the predicate of an attributive sentence where the user is the subject, we assign the gender of the noun adjunct as the gender of the user (e.g. “Daureos es trabajador”). 4. Finally, we select the gender that is associated the most to the post analysed for the user being analysed. 239 func: top synt: sn form: Felicidades lemma: felicidades tag: NP00SP0 func: sp-mod synt: grup-sp form: a lemma: a tag: SPS00 func: sn-mod synt: sn form: CM_de_El_Corte_Inglés lemma: cm_de_el_corte_inglés tag: NP00V00 func: obj-prep synt: sn form: cuñado lemma: cuñado tag: NCMS000 func: espec synt: espec-ms form: mi lemma: mi tag: DP1CSS func: adj-mod synt: s-a-ms form: nuevo lemma: nuevo tag: AQ0MS0 func: term synt: F-term form: . lemma: . tag: Fp func: sn-mod synt: w-ms form: Calamonte lemma: calamonte tag: NP00SP0 Figure 8.21: Dependency tree obtained from a tweet that mentions to a user 8.6.3.2 Generate Test Design Task As the technique that we propose does not perform learning from data, the whole dataset has been used for evaluation purposes. As described in Section 8.6.1.1 the whole dataset is used for measuring the coverage of the technique (i.e. the proportion of users that can be annotated with a gender), and a subset that has been manually annotated with gender is used for measuring the precision and recall. 240 8.7 Evaluation This section evaluates the techniques for the segmentation of consumers from content presented in this chapter. Section 8.7.1 describes the metrics used for evaluating the techniques, while Section 8.7.2 present the evaluations results. 8.7.1 Evaluation Metrics For evaluating the techniques for segmentation of consumers from social media content, we made use of a set of metrics commonly used in machine learning for evaluating supervised classiﬁers. In this context: • T P is the number of true positive decisions. It indicates the number of instances that have been classiﬁed as belonging to a particular class, and actually belong to such class. • T N is the number of false positive decisions. It indicates the number of instances that have not been classiﬁed as belonging to a particular class, and actually do not belong to such class. • F P is the number of false positive decisions. It indicates the number of instances that have been classiﬁed as belonging to a particular class, and actually do not belong to such class. • F N is the number of false negative decisions. It indicates the number of instances that have not been classiﬁed as belonging to a particular class, and actually belong to such class. Taking into account the T P , T N , F P , and F N indicators, the metrics used for evaluating the performance of the technique for unique user identiﬁcation are described next. 241 8.7.1.1 Accuracy The Accuracy metric [Kohavi and Provost, 1998] measures the percentage of correct decisions. Equation 8.1 shows its deﬁnition. RI = TP + TN TP + FP + TN + FN (8.1) The range of this metric is [0..1]. We consider satisfactory values for this metric those that are over 0.85. 8.7.1.2 Recall The Recall metric [Kowalski, 1997] (a.k.a. sensitivity or hit rate) is the true positive rate. Equation 8.2 shows its deﬁnition. The range of this metric is [0..1]. For the evaluations of this section, we consider satisfactory values for this metric those that are over 0.30. Recall = 8.7.1.3 TP TP + FN (8.2) Precision The Precision metric [Kowalski, 1997] is deﬁned as the positive predictive value. Equation 8.3 shows its deﬁnition. P recision = TP TP + FP (8.3) The range of this metric is [0..1]. For the evaluations of this section, we consider satisfactory values for this metric those that are over 0.65. 8.7.1.4 F-measure The F-measure metric [Larsen and Aone, 1999] combines the precision and recall metrics oﬀering an overall vision of how the technique behaves. It is deﬁned as the harmonic mean of precision and recall. Equation 8.4 shows its deﬁnition. F1 = 2 · P recision · Recall P recision + Recall 242 (8.4) The range of this metric is [0..1]. For the evaluations of this section, we consider satisfactory values for this metric those that are over 0.41, taking into account the minimum Precision and Recall satisfactory values. 8.7.2 Evaluation Results This section present the results of the evaluations performed to the techniques described in this chapter. The section is structured as follows: • Section 8.7.2.1 presents the evaluation results obtained for the technique for detecting Consumer Decision Journey stages. • Section 8.7.2.2 presents the evaluation results obtained for the technique for detecting Marketing Mix attributes. • Section 8.7.2.3 presents the evaluation results obtained for the technique for detecting emotions. • Section 8.7.2.4 presents the evaluation results obtained for the technique for detecting the place of residence of social media users. • Section 8.7.2.5 presents the evaluation results obtained for the technique for detecting the gender of social media users. 8.7.2.1 Technique for Detecting Consumer Decision Journey Stages We have evaluated our technique for detecting Consumer Decision Journey stages from user-generated content. The overall results of the textual classiﬁcation in terms of precision are 0.74, while in terms of recall are 0.35, achieving an Fmeasure of 0.48. Figures 8.22 and 8.23 show the results by category and language. In general, the rules achieved satisfactory results in terms of precision, especially in the awareness, evaluation, and purchase stages for English, and awareness for Spanish. Results in terms of recall were lower than those achieved in precision, as rules were designed very speciﬁc in order to minimise the number of false positives. Generally, the stage where we obtained best results is awareness, speciﬁcally for Spanish. 243 ' && & %& %! #! !' "" !$ !! # "& ! "% !" Figure 8.22: Accuracy of the Consumer Decision Journey classiﬁer for English We also oﬀer the results for the classiﬁcation along the diﬀerent business sectors (Figure 8.24) in order to evaluate the diﬃculties of the classiﬁcation depending on the domain. We found that banking and beverages were the business sectors where we obtained the best results, with the greatest values of F-measure. The distinction among the diﬀerent stages of the Consumer Decision Journey is not always clear, due to the ambiguity of short texts. Frequently, belonging to one stage or another is strongly related to the type of product, and the diﬀerentiation among stages can only be performed applying extra linguistic knowledge. Sentences such as “I like this beer” and “I like this car” were frequently found in the corpus. In the ﬁrst case, it is very likely that the user has already tried the product (postpurchase experience), since it would be strange for a customer to state that he likes a drink (or some food) without actually tasting it. In the second case, instead, the actual consumption of the product is less probable, and the customer can like the car just because of its television advertisement or its design, for example. These kinds of ambiguities are especially frequent between 244 '& && & %$ % $" $! #( !% !& # "% "# "( !( Figure 8.23: Accuracy of the Consumer Decision Journey classiﬁer for Spanish evaluation and postpurchase experience, and the linguistic patterns are not able to capture the diﬀerences between them since they are expressed through the same linguistic expression. A further classiﬁcation of products depending on domain-dependent features could be useful in order to discriminate between evaluation and postpurchase experience in these types of ambiguous cases. Finally, there are multiple geographic varieties for English and Spanish that present lexical diﬀerences. This implies additional diﬃculties to pattern identiﬁcation, since lexical units diﬀer from a variety to another and are especially hard to detect. Further work in this line (i.e. improving the normalisation process by transforming lexical units to a canonical form) could help to improve the recall results. 245 "*# "*" ")' ")% ")# "($ "'( "(' "(* "() "(& "') "'' "&* "&& "&% "%* "%% "%( "%$ "%" "$) "$$ "$# Figure 8.24: Accuracy of the Consumer Decision Journey classiﬁer by sector 8.7.2.2 Technique for Detecting Marketing Mix Attributes We have also evaluated how the decision tree classiﬁers perform in the classiﬁcation of each short text depending on the Marketing Mix element (or elements) expressed. We have used the 10-fold cross-validation approach for evaluating the developed classiﬁers. We have obtained an overall precision of 0.75 and an overall recall of 0.37, being the F-measure of 0.5. The results obtained in this task for English and Spanish can be seen respectively in Figures 8.25 and 8.26. As observed in the ﬁgures, the results are generally low (except for Advertisement) in terms of recall, which range from 0.04 to 0.80 for Spanish and from 0.09 to 0.83 for English. It seems that there is a logical relation between the number of texts of the positive class utilised to train the model and the corresponding results in terms of recall and precision. For example, in Spanish the classiﬁer that was trained with a smaller number of texts, was the one for the positive class of Customer Service, where we only had 85 short texts. The results of the classiﬁcation are 0.04 and 0.38 for recall and precision, respectively. In the same 246 %!.( %!-- %!-) %!-( %!,* %!,( %!+, %!+) %!)- %!+. %!+& %!)* %!)) %!)( %!)& %!(* %!() %!'. %!'. %!'& %!&* %!'. %!(& %!'( %!&- %!&( %!%. " Figure 8.25: Accuracy of the Marketing Mix classiﬁer for English line are the results for English; one of the Marketing Mix elements trained with less texts of the positive class (238) is Point of Sale, therefore the results obtained are also the lowest ones: a recall of 0.09 and a precision of 0.48. We can observe the same situation in the models trained with a larger number of texts; both in Spanish and English, the Advertisement classiﬁer was trained with a lot of positive examples, and thus this class achieved very good results in terms of recall as well as precision (0.80 and 0.83 for recall and 0.88 and 0.93 for precision, for Spanish and English, respectively). It is also interesting to see how some Marketing Mix elements are much more diﬃcult to identify than others. For example, we can observe that the element Quality is very hard to classify, even increasing the number of texts used to train the model. In Spanish the number of texts used as positive examples is 371 and we obtained 0.18 and 0.56 of recall and precision respectively. However, in English, where the model was trained with a larger number of texts as positive examples (1,046 texts), the results are in line with those obtained for Spanish: 247 %!-- %!-) %!-% %!-( %!,+ %!+, %!+, %!+' %!*+ %!** %!), %!*+ %!*& %!)* %!)& %!(* %!)& %!(- %!)) %!)' %!(, %!', %!(% %!(' %!&- %!%+ %!%) " Figure 8.26: Accuracy of the Marketing Mix classiﬁer for Spanish 0.13 of recall and 0.61 of precision. These diﬀerences of diﬃculty among sectors are due to the dispersion of the vocabulary used to talk about some Marketing Mix elements. For example, we observed that customers could talk about Quality making reference to the comfort (for Automotive industry, for example), to the security (in Banking, for instance) or to the taste (for Food or Beverages). Therefore, the reference to Quality can be made through a great variety of topics that are domain dependent and thus, the reference to this element is much more varied than the reference to other Marketing Mix elements such as Price or Advertisement. The linguistic cues are more disperse and thus the classiﬁer ﬁnds more diﬃculties to relate a word with a speciﬁc class. Although the results specially in terms of recall should be improved, we consider that as a ﬁrst attempt to automatically classify and ﬁlter user-generated content from social media in terms of Marketing Mix elements, the results obtained are very encouraging and very satisfactory for elements such as Advertisement. Finally, as happened with the technique for identifying Consumer Decision Journey stages, the language varieties aﬀect to the precision results. For example, 248 the term “commercial” for American Spanish means “advertising spot”, while for European Spanish means “sales person”. While the former meaning should be associated to the Advertising category, the latter meaning should be associated to the Point of Sale category. 8.7.2.3 Technique for Detecting Emotions We have evaluated our system against a set of randomly chosen texts that cover 58% of the corpus, as described in Section 8.4.2.2. The overlap coeﬃcient between the training set and the evaluation set was 0.14, quite small, so we could trust the results of the evaluation as very reliable. Figure 8.27 shows the precision and recall obtained for each emotion of our conceptual framework. We can see the number of texts classiﬁed under each emotion both by our system and by the human annotator. The overall recall is 49.73% and the precision is 71.78%. If we used the F-measure as an indicator of the best results, these would correspond to satisfaction and dissatisfaction. This fact is not surprising, since the majority of the texts expressing sentiment in the corpus and, therefore, in the training corpus belong to one of these two categories. Figure 8.28 shows the precision and recall obtained for each domain and Figure 8.29 shows the precision and recall obtained for each type of media. We have also compared our results to the ones provided by an existing commercial tool for detecting polarity of opinions, owned by Havas Media Group. Such system is also rule-based and the rules follow a similar approach, although the antecedent only supports components made of lemma and part-of-speech and the consequent only considers one category that captures the negative and the positive opinions, instead of the four ones (reﬂecting eight sentiments) of our work. An important diﬀerence between the two experiments is the size of the corpora used for evaluation: the corpus used for evaluating the polarity classiﬁer contained 3,705 texts while ours contained 15,428 texts. The polarity system has a recall of 20.82% (less than ours, 49.73) and a precision of 84.85% (more than ours, 71.78). However, when we reduce our four categories to one (putting together negative polarities on one side and positive 249 "** "*# "($ "'% "&" "'# "&' "%* "%' "%% "%) "%( "%' "%) ")" "(* ")$ "'+ "'* "'$ "&( ")% ")# ")" "'" "&* "%( Figure 8.27: Accuracy of the emotions classiﬁer ones on the other) the results show a recall of 58.48% and a precision of 84.42%. Thus, under such circumstances, we can aﬃrm that we achieve a similar precision to and a better recall than the previous system, and certainly it is based on a more ﬁne-grain classiﬁcation. 250 $,$ $+- $+' $+' $*, $*' $+$ $*) $*' $+, $*, $*' $*& $*$ $), $)) $*+ $*) $), $)( $(+ $() $)% $)$ $(* $(& $') ! Figure 8.28: Accuracy of the emotions classiﬁer by sector $! $ $ #& ## # # "! " "% "! " " !! !! Figure 8.29: Accuracy of the emotions classiﬁer by social media type 251 Approach Based on metadata about locations of users Based on friendship networks Based on descriptions about users Content-based Hybrid Accuracy 0.81 0.86 0.81 0.81 0.81 Table 8.9: Accuracy of the place of residence identiﬁcation approaches 8.7.2.4 Technique for Detecting Place of Residence We have evaluated the ﬁve diﬀerent approaches implemented by this technique against the evaluation data set described in Section 8.5.3.2. The evaluation results are shown in 8.9. All the approaches achieve the same accuracy (0.81), with the exception of the one based on friendship networks, which improves de accuracy to 0.86, out performing the approaches described in the State of the Art that achieve accuracies from 0.51 to 0.71. Regarding the approaches that perform named entity recognition for detecting the locations included in the description of user proﬁles, or in the content published and shared by those users, we have evaluated this step by using the training set published by the Concept Extraction Challenge of the #MSM2013 Workshop [Basave et al., 2013]. Such training set consists of a corpus of 2.815 micro-posts written in English. The precision obtained is 0.52, while the recall is 0.43 (F1 =0.47). 8.7.2.5 Technique for Detecting Gender We have evaluated the coverage (i.e. proportion of users classiﬁed) of the two gender recognition approaches described by this technique against the whole evaluation data set described in Section 8.6.3.2. The approach based on proﬁle metadata has been able to classify 46,030 users (9,284 female users and 36,746 male users), achieving a coverage of 66% of the corpus. By contrast, the approach based on mentions to users has classiﬁed 46,396 users (9,386 female users and 37,010 male users), improving the coverage up to 67%. Table 8.10 compares the coverage of both approaches. In addition, we have checked the automatic classiﬁcation with respect to the 252 Approach User Names Mentions to Users Coverage Gain Female 9,284 (13%) 9,386 (14%) +102 Male 36,746 (53%) 37,010 (53%) +264 Not Identiﬁed 23,231 (34%) 22,864 (33%) (Total Gain = +1%) Table 8.10: Coverage of the gender recognition approaches Figure 8.30: Performance of the gender recognition approaches gold standard, obtaining an overall accuracy of 0.9 for the approach based on user names, and of 0.84 for the approach based on mentions to users. By gender, for the approach based on user names, the precision obtained is 0.98 for male users and 0.97 for female users, while the recall is 0.8 and 0.87, respectively. For the approach based on mentions to users, the precision obtained is 0.8 for male users and 0.79 for female users, while the recall is 0.85 and 0.95, respectively. Therefore, the approach based on mentions to users achieves a smaller precision, but increases the recall with respect to the approach that only makes use of user names. Figure 8.30 compares the performance of the two approaches. As explained in Section 8.6.2 we perform automatic language identiﬁcation during the Clean Data task for ﬁltering users that do not write in Spanish. The false positives introduced by the language identiﬁcation component, whose accuracy is 0.9302, may cause the inclusion of authors in the evaluation corpus that might not be Spanish speakers, penalising the method recall. Table 8.11 shows the confusion matrix for the approach based on mentions 253 Actual class Male Female No gender Male 530 10 130 Predicted class Female No gender 42 49 528 20 97 103 Table 8.11: Confusion matrix with the results of the approach based on mentions to users. to users. Users manually annotated as “no gender” correspond to non-personal Twitter accounts (e.g. a brand or a corporation), while those automatically classiﬁed as “no gender” are the users for which the algorithm was not able to identify a gender. Mainly, the confusions are produced between the male and female classes and the residual class. As the table reﬂects, there is not a signiﬁcant number of confusions between male and female users (i.e. male users classiﬁed as female and vice versa). Most of the errors correspond to male or female users that could not been classiﬁed by the gender recognition technique. It is diﬃcult to make a direct comparison of our technique with the previous works described in the State of the Art (Section 2.6.6), since our classiﬁer has been designed for the Spanish language and the other ones have been trained and evaluated with a corpora of English speakers. If we ignore this fact, the technique developed by Mislove et al. [2011] identiﬁes a gender for the 64.2% of the users, while ours achieves a coverage of 66.45%. Additionally, we have achieved less accuracy than Burger et al. [2011], who achieved 0.92. However, the technique proposed by Burger et al. [2011] requires more than 100,000 users in the training data set (together with the tweets authored by them), while our technique does not require training a classiﬁer as it relies in linguistic knowledge, avoiding the cost of corpus annotation by humans. Regarding the distributions by gender, Mislove et al. [2011] identiﬁed a 71.8% of male users for the U.S. population that use Twitter. In our case, we identiﬁed a 79.8% of male users conﬁrming that Spanish speakers on Twitter are also predominantly male within the period of the experiment (May 2012 - March 2013). 254 8.8 Validation of Hypotheses The evaluation performed to our approach for identifying Consumer Decision Journey stages in user-generated content validates Hypothesis 3, since our technique is able to classify texts along the diﬀerent phases with an acceptable accuracy with precision results similar to the works on identiﬁcation of wishes. Consequently our technique is able to approximate distributions of consumers (i.e. the authors of the texts) in the exact moment of the Consumer Decision Journey process. The evaluation performed to our approach for detecting Marketing Mix attributes in user-generated content validates Hypothesis 4, since our technique is able to classify texts according to the Marketing Mix framework with an acceptable accuracy, and consequently is able to approximate distributions of consumers (i.e. the authors of the texts) that refer to the distinct Marketing Mix elements. The evaluation performed to our approach for detecting emotions in usergenerated content validates Hypothesis 4, since our technique is able to identify expressions of satisfaction, dissatisfaction, trust, fear, love, hate, happiness, and sadness within user-generated content with an acceptable accuracy, and consequently is able to approximate distributions of consumers (i.e. the authors of the texts) that express the diﬀerent kind of sentiments. Regarding place of residence detection, the evaluation performed validates Hypothesis 6, since the most accurate approach is the one based in friendship networks. Therefore the homophily that characterises social networks can be exploited for determining the place of residence of social media users. The results obtained show that the social network is a valuable source of information for obtaining the socio-demographic attributes of single users. Finally, the evaluation results regarding gender detection show that the approach that exploits the gender concord existing in the contents that explicitly mention social media users when a gender cannot be retrieved from user’s metadata improves the coverage of the gender identiﬁcation technique. This validates Hypothesis 7. 255 256 Chapter 9 CONCLUSIONS AND FUTURE WORK Social media has been in the centre of attention of advertising agencies as it has come to form part of the media addressed by marketing activities. Advertising agencies have been exploring possible ways to use this new media as a mechanism of producing word-of-mouth. Therefore social media is being considered as a platform in a viral marketing strategy. One of the expected beneﬁts of this thesis is to provide marketers and business experts with tools for understanding the principal functions of social media from a marketing point of view. That is, disentangle the eﬀect social media have in consumer behaviour during the various stages of the decision making process. As the main conclusion, the techniques described in this thesis can be implemented within applications that aim at observing consumers in social media extracting socio-demographic and psychographic information from them. We have deﬁned an ontology network that structures the information published in social media that is useful for marketing analysis purposes, and we have characterised such media media by analysing the morphosyntactic characteristics of the content published on them. Additionally, we have provided a technique for uniquely identiﬁes social media users using the ﬁngerprint in their devices, regardless the changes that occur frequently in these ﬁngerprints. We also have provided a collection of techniques for obtaining psychographic segmentations 257 of consumers in terms of their position in the purchase funnel, the marketing attributes of the brands they refer to, and their sentiment about these brands. Finally, we have described a set of techniques for identifying two sociodemographic attributes from social media users, i.e. their place of residence and their gender. Next, we detail the conclusions for each of the contributions of this thesis to the State of the Art. 9.1 Social Media Data Model for Consumer Analytics We have developed an ontology that models information that can be extracted from social media about consumers. Such information can be directly retrieved from social media data or inferred from users’ activity and opinions. By combining and structuring the directly and indirectly retrieved data we are able to store enriched consumer-related information in a graph-based database for been analysed in diﬀerent manners by marketing professionals. As an example, through a CRM connection (e.g. implemented by a plugin of a CRM system) this information could be prompted to standard business applications and be accessible for daily business decisions. 9.2 Morphosyntactic Characterisation of Social Media Contents Natural language processing (NLP) techniques are a key piece for analysing the content published in social media. Social media content presents the characteristics of non-editorially controlled media, as opposite to the content published in traditional media. In this context, social media communication has moved from daily publications to real-time interactions. Thus, when applying NLP techniques to the user-generated content published in social media, we ﬁnd issues on text quality that hinder the application of such techniques. 258 Moreover, if we analyse social media sources separately, we ﬁnd that there are diﬀerences on language styles, expressiveness degrees, and levels of formalism that are conditioned by factors such as content length or publication pace. Namely, text length varies form short sentences posted in Twitter to medium-size articles published in blogs; very often the text published in social media contains misspellings, is completely written in uppercase or lowercase letters, or is composed of set phrases; to mention a few characteristics that make social media content analysis challenging. Speciﬁcally we have demonstrated than the distribution of part-of-speech categories varies across diﬀerent social media types. Since part-ofspeech tagging is a previous step for many NLP techniques, the performance of such techniques may vary according to the social media source from which the user-generated content has been extracted. 9.3 Technique for Unique User Identiﬁcation Based on Evolving Device Fingerprint Unique user identiﬁcation is an essential activity in order to obtain accurate results from Web Analytics, since many Web Analytics metrics depend on measuring unique visitors. The most widespread technique for uniquely identifying users is the one based on cookies. However, such technique is not completely eﬀective because cookies can be removed, disabled, or not supported. Recently, a new technique for user identiﬁcation has been proposed. Such technique consists on capturing the ﬁngerprint of the machine that the user uses for navigating the Web. One drawback of this technique is that such ﬁngerprint changes over time, so that the registration of ﬁngerprints must be accompanied by a mechanism for detecting its temporal evolution. In this thesis, we have described an algorithm that allows clustering ﬁngerprints that correspond to the same user, regardless of ﬁngerprint evolution. The evaluation results demonstrate the eﬀectiveness of the algorithm, and improve previous results. The algorithm proposed can be used instead of the technique based on cookies, or as a complement to this technique for regenerating cookies when such cookies are removed. If the algorithm is used as an alternative to the technique based on 259 cookies, every time an activity record is registered, the ﬁngerprint obtained must be compared to each cluster of browser ﬁngerprints generated before, because of the algorithm linear complexity. In contrast, if the algorithm is used as a mechanism for regenerating cookies, the ﬁngerprint must be compared with existing clusters only when the cookie is deleted, reducing signiﬁcantly computational resources needed for identifying users and augmenting, even more, the accuracy of unique user identiﬁcation. Moreover, this variant could be supplemented with the use of Internet Explorer data persistence and web storage capabilities. Our algorithm improves the accuracy of unique browser identiﬁcation over previous approaches, letting eﬀectively counting unique visitors, thus measuring the impact of digital advertisement campaigns better in environments where existing techniques fail (e.g. mobile devices or smart TVs which do not support cookies). Therefore, the algorithm measures the audience of on-line campaigns effectively regardless the device and security restrictions, which enhances decision support. Previous approaches were temporally constrained because of cookie deletion or ﬁngerprint attribute changes. Thus reporting periods were aﬀected by such temporal constraints. Our approach enables tracking user activity during more time since it allows recovering from ﬁngerprint changes (or cookie deletions when combined with the user identiﬁcation technique based on cookies). Thus, website or advertisement campaign monitoring periods can be larger without losing accuracy. In addition, advertisers will be beneﬁted with more precise audience measures, avoiding counting the same browser more than once. This will impact positively on media planning optimisation allowing better budget distribution over diﬀerent online media, and enhancing performance metrics and user proﬁling. The algorithm can be executed as a batch process or in real-time as new ﬁngerprints arrive to the system. A real-time version of the algorithm should require optimisations to reduce the number of comparisons between ﬁngerprint and cluster signatures, reducing processing time. A disadvantage of the technique described in this document is the amount of additional JavaScript code to be added to web pages in order to get some ﬁngerprint attributes. Such scripting code could prevent certain advertising media of adopting the technique. Nevertheless, importing external JavaScript deﬁnitions 260 reduces the code to be inserted in web pages to one line. Finally, with respect to the ethical aspects of user tracking, Sison and J. [2005] discuss issues relating to privacy on the on-line advertising domain. It is important to remark, that the aim of this research is not invading user privacy, but uniquely accounting the users that visit a given website. Thus, we are not interested in personal data about users, but in accurate Web Analytics measures at the aggregated level. Moreover, browser ﬁngerprinting does not suppose a threat for user privacy when appropriate anonymization techniques are applied, for instance, transforming data applying cryptographic functions, such as SHA-1 [Eastlake and Jones, 2001], to ﬁngerprint attribute values. Anyway, technologies implementing our technique, and other similar ones, should follow policies such as “Do Not Track” [Mayer et al., 2011], which enables users to opt out tracking by websites they do not visit, including analytics services and advertising networks. 9.4 Techniques for Segmentation of Consumers from Social Media Content This section presents the conclusions regarding the techniques provided by this thesis for segmenting consumers according to the contents they publish and share in social media. Future lines of work include experimenting with the detection of more demographic and psychographic user characteristics which are relevant to the marketing and communication domains, including: age, political orientation and interests, among others. 9.4.1 Technique for Detecting Consumer Decision Journey Stages We have presented a novel technique for analysing user-generated texts in terms of their belonging to one of the four stages of the Consumer Decision Journey. Using a corpus made up of texts extracted from diﬀerent social media sources and pertaining to several business sectors, we manually identiﬁed speciﬁc linguistic 261 patterns and used them in a rule-based classiﬁer to unambiguously distinguish among texts related to the diﬀerent stages. We achieved an overall precision of 0.78 and 0.65, and an overall recall of 0.34 and 0.39, for English and Spanish, respectively. To our knowledge, this is the ﬁrst attempt to automatically obtain Consumer Decision Journey business indicators from user-generated content using rule-based classiﬁers. The automatic identiﬁcation of these business indicators is very much needed in order to drastically reduce time and eﬀorts in their manual activities by marketing analysts. Due to the novelty of this research area, much work remains to be done, including its adaptation to other languages and the research on possible methods to improve the overall recall. Lastly, we also plan to include more business sectors in order to make the system more robust. 9.4.2 Technique for Detecting Marketing Mix Attributes We have developed machine-learning classiﬁers that enable us to identify Marketing Mix elements in user-generated texts. This allows a more accurate, ﬁnegrained consumer buzz analysis (i.e. not only establishes purchase stages but identiﬁes relevant, common topics of conversation among customers throughout their shopping experiences) and, in consequence, enables marketers to take betterinformed business decisions. The system has been implemented training a set of Decision Tree classiﬁers achieving an overall precision of 0.76 and 0.75, and an overall recall of 0.44 and 0.31, for English and Spanish, respectively. As happened with the Consumer Decision Journey classiﬁer, to our knowledge, this is the ﬁrst attempt to automatically obtain Marketing Mix business indicators from user-generated content using machine-learning classiﬁers, reducing earned media analysis eﬀorts to marketing analysts. Also, due to the novelty of this research, much work remains to be done, like adapting the technique to other languages, improving the recall, or learning texts from new business sectors. 9.4.3 Technique for Detecting Emotions In this thesis, we have developed a rule-based technique that classiﬁes Spanish texts from diﬀerent social media channels according to four polarised cat- 262 egories (satisfaction-dissatisfaction, trust-fear, love-hate and happiness-sadness) that capture the main sentiments expressed through these channels. The results of the evaluation of the technique (49.73% recall and, 71.78% precision) are quite satisfactory, considering the ﬁne-grain classiﬁcation. Nevertheless, reﬁning and expanding the set or rules (consisting of more than 1200 rules at this moment) can improve the results. We have found a set of future lines of work, which are described next. Rules that are too speciﬁc match few texts, thus making it necessary to have a huge set of rules in order to cover all the domains. However, this speciﬁcity leads to a higher accuracy, i.e. when an antecedent matches (part of) a text, the system would very likely classify it correctly. In addition we have devised several ways to expand the set of rules by adding rules based on the existing ones: • Replacing words or lemmas with others that do not appear in the analysed corpus. The ideal substitutes are the synonyms of the ones actually examined in the same context. For verbs, good replacement candidates are those that are collocates of the same sentiment. Badele3000 can provide us with this information. As we have seen, it can help us to retrieve domainindependent collocations of common sentiments, along with the semantic relation between the terms of those collocations. For example, since both sentir (to feel) and tener (to have) are values of the LF Func1 for odio (hatred) (see Table 8.8), rule (1) could be added, as it is equivalent to the following rule (2). tener#V odio#NC [SP] ENTITY → LH - 1 (1) sentir#V odio#NC [SP] ENTITY → LH - 1 (2) • Elaborating less restrictive rules, i.e. omitting some of the elements in the antecedent. This generalisation would likely lead to a larger coverage. Nevertheless, there is no guarantee that the resulting rules would not decrease the accuracy of the system. A new evaluation should be carried out for each new rule in order to know its impact. Accordingly, a trade-oﬀ between coverage and accuracy is sometimes necessary. For example, since 263 Meaning in Spanish siento fuerte odio a/por marca siento odio fuerte a/por marca Meaning in English I feel strong/forceful hatred against/for brand I feel strong/forceful hatred against/for brand Rules [D] fuerte#A odio#NC [SP] ENTITY → LH - 1 sentir#V odio#NC fuerte#A [SP] ENTITY → LH - 1 Table 9.1: Rule reordering example texts without occurrences of the entity have been discarded, a shallower approach, where the entity is not part of the antecedent, could be considered. Thus, we could derive rule (3) from rule (4) by omitting ENTITY . Another example could be removing a lemma and taking into account only its part-of-speech tag. For instance, rule (4) comes from rule (3) by replacing adverb muy (very) with any non-negative adverb ([RG] ). However, this rule is not correct, since poco (little) is an adverb that diminishes the adjective degree while muy intensiﬁes it. muy#RG odioso#A → LH – 2 (3) ENTITY ser#V [RG] odioso#A → LH – 2 (4) We could also beneﬁt from resources with domain knowledge (e.g. an ontology of the products of a ﬁeld). In that case, we could write less speciﬁc antecedents in our rules and use that knowledge instead. • Re-ordering the components of the antecedent. In Spanish, this can be done not only by shifting passive and active voice, but also by using a hyperbaton (i.e. Spanish has a very free syntax, where several syntactic combinations of words can be correct sentences). For instance, many times the positions of nouns and their adjectives can be interchangeable. Thus, both fuerte odio and odio fuerte are correct (see Table 9.1). As we have explained, some rules are created by using domain-independent resources and procedures. Thus, besides evaluating our system with this new set of rules, we also plan to apply it to new domains in order to analyse its generality. Finally, the grammar allows for quite a ﬂexible speciﬁcation at the morphosyntactic level, but sometimes information at the syntactic dependency level can 264 be useful too. For instance, knowing the scope of a negation could help to determine the units to be computed by the classiﬁer. 9.4.4 Technique for Identifying the Place of Residence of Social Media Users The evaluation results obtained for the technique for identifying the place of residence of social media users show that the approaches that make use of the user’s community achieve better performance than the ones based on the analysis of the content published and shared by the user. While the major part of the community of a user shares the place of residence (because of the homophily principle in social networks), the mentions to locations included in the content published by the users are not related necessarily with their place of residence. 9.4.5 Technique for Identifying the gender of Social Media Users We have achieved very satisfactory results for gender identiﬁcation by just making use of user proﬁle metadata, since the precision obtained is high and the technique used is very simple with respect to computational complexity, which leads to a straightforward set up in a production environment. The approach based on mentions to users increases the recall in the cases where the technique based in metadata about users is not able to identify the gender, because for the Spanish language there exists grammatical agreement with respect to gender between nouns and other part-of-speech categories (e.g. adjectives and pronouns). This technique can be extended in the future with the use of facial analysis techniques, like the one proposed by Bekios-Calfa et al. [2014], as many users publish their photograph in their social media proﬁles. 9.4.6 Normalisation of User-Generated Content The text classiﬁers described in this thesis make use of an approach for usergenerated content normalisation that relies on existing web resources collectively 265 developed, ﬁnding that such resources, useful for many NLP tasks, are also valid for the task of micropost normalisation. With respect to the future lines of work, we plan to adapt the normaliser to new languages by the incorporation of the corresponding dictionaries and improving the existing lexicons by the use of more available resources, such as the anchor texts from intra wiki links. Finally, we plan to improve the normalisation of typos consisting in multiword expressions, as diﬀerent words should be transformed into just one (e.g. the Spanish expression “a cerca de” should be transformed into “acerca de”), as well as cases where joined words should be split (e.g. “realmadrid” should be transformed into “real madrid”) by using existing word breaking techniques, such as the one described by Wang et al. [2011]. 9.4.7 Evaluation of Scalability Because of its scale, brands’ earned media mentions extracted from social media channels and gathered by marketing and communications agencies can be considered “Big Data”, as they are characterised by its huge volume of data, high velocity of production, and high heterogeneity [O’Leary, 2013]. Media agencies like GroupM103 or Havas Media Group extract more that 1,200 million posts a year from its social media monitoring tools, including mentions to its monitored brands and their competitors. This represents a volume of more than 1.5 TB of raw data mainly consisting of text, associated content and authors’ metadata. Such volume grows very signiﬁcantly when is processed, augmented with diﬀerent classiﬁcations, and integrated and indexed within databases. The high velocity in which data is produced is a challenge, as data needs to be processed faster than content is produced, at a near real-time pace, even if the content is batch-processed. In addition, variety along several dimensions (e.g. content quality, multilinguality, multiplicity of formats, diversity of technologies and techniques to be integrated) has conditioned the infrastructure developed to evaluate the scalability of the work presented in this paper. 103 http://www.groupm.com 266 We have performed a preliminary test of the scalability of the software components by integrating them within a Big Data processing platform. However, a more rigorous validation of the scalability of the techniques presented in this work in a Big Data scenario is still pending. Speciﬁcally, we have integrated the techniques for consumer segmentation presented in this thesis into a Big Data infrastructure. Such infrastructure is based in Hadoop-related104 technologies, namely, Flume105 for real-time consumption of posts, Hive and MapReduce for batch processing and data aggregation, HDFS for temporal data storage, and HBase for storing the linguistic resources queried by our classiﬁers. Once the data are processed, they are indexed in a Solr106 cloud environment, and aggregation results are uploaded to relational databases with OLAP capabilities. Processes have been developed using the Scala107 programming language. Measures of the time required for the multi-classiﬁcation of each piece of text show that it takes an average of 0.46 seconds per post (note that length of test varies across diﬀerent sources). Therefore, we found it very useful in order to automatically tag the data stream continuously extracted and analysed by marketing companies. 104 http://hadoop.apache.org http://flume.apache.org 106 http://lucene.apache.org/solr 107 http://www.scala-lang.org 105 267 268 REFERENCES Aguado de Cea, G., Barrios, M., Bernardos, S., Campanella, I., Montiel-Ponsoda, E., Mu˜ noz-Garc´ıa, O., and Rodr´ıguez, V. (2014). An´alisis de sentimientos en un corpus de redes sociales. In Proceedings of the 31st International Conference of the Spanish Association of Applied Linguistics, AESLA’14, pages 18–20, San Crist´obal de la Laguna, Tenerife, Spain. Aguado de Cea, G. and Bernardos, S. (2007). Cal´ıope: herramienta para gestionar un corpus y un glosario de t´erminos inform´aticos. In Proceedings of the 6th Annual Conference of the European Association of Languages for Speciﬁc Purposes, AELFE’07, pages 292–299, Lisbon, Portugal. Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padr´o, L., San Vicente, I., Turmo, J., and Zubiaga, A. (2013). Introducci´on a la tarea compartida tweetnorm 2013: Normalizaci´on l´exica de tuits en espa˜ nol. In Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padr´o, L., San Vicente, I., Turmo, J., and Zubiaga, A., editors, Proceedings of the tweet normalisation workshop co-located with 29th conference of the Spanish Society for Natural Language Processing, SEPLN’13, pages 1–9, Madrid, Spain. Alvestrand, H. T. (1995). RFC 1766 – Tags for the identiﬁcation of languages. https://www.ietf.org/rfc/rfc1766.txt. Arnold, M. (1960). Emotion and personality: psychological aspects. Emotion and Personality. Columbia University Press. Asur, S. and Huberman, B. A. (2010). Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web 269 Intelligence and Intelligent Agent Technology - Volume 1, WI-IAT’10, pages 492–499, Washington DC, USA. IEEE Computer Society. Atserias, J., Comelles, E., and Mayor, A. (2005). TXALA: un analizador libre de dependencias para el castellano. Procesamiento del Lenguaje Natural, 35:455– 456. Backstrom, L., Kleinberg, J., Kumar, R., and Novak, J. (2008). Spatial variation in search engine queries. In Proceedings of the 17th international World Wide Web Conference, WWW’08, pages 357–366, Beijing, China. ACM. Backus, J. W., Bauer, F. L., Green, J., Katz, C., McCarthy, J., Perlis, A. J., Rutishauser, H., Samelson, K., Vauquois, B., Wegstein, J. H., van Wijngaarden, A., and Woodger, M. (1963). Revised report on the algorithm language ALGOL 60. Communications of the ACM, 6(1):1–17. Basave, A. E. C., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. (2013). Making sense of microposts (#msm2013) concept extraction challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on ’Making Sense of Microposts’ co-located with the 22nd International World Wide Web Conference, WWW’13, pages 1–15, Rio de Janeiro, Brazil. Bekios-Calfa, J., Buenaposada, J. M., and Baumela, L. (2014). Robust gender recognition by exploiting facial attributes dependencies. Pattern Recognition Letters, 36:228–234. Bernardos, S. and Barrios, M. (2008). Data model for a lexical resource based on lexical functions. Research in Computing Science, 27:9–22. Berners-Lee, T. (1994). RFC 1738 – Uniform Resource Locators (URL). https: //www.ietf.org/rfc/rfc1738.txt. Berners-Lee, T., Fielding, R. T., and Masinter, L. (2005). RFC 3986 - Uniform Resource Identiﬁer (URI): generic syntax. https://www.ietf.org/rfc/ rfc3986.txt. 270 Boda, K., F¨oldes, A., Guly´as, G., and Imre, S. (2012). User tracking on the web via cross-browser ﬁngerprinting. Information Security Technology for Applications, 7161:31–46. Borden, N. H. (1964). The concept of the marketing mix. Journal of Advertising Research, 4(2):2–7. Box, G. E. P. and Jenkins, G. (1990). Time series analysis, forecasting and control. Holden-Day, Incorporated. Breslin, J. G., Decker, S., Harth, A., and Bojars, U. (2006). SIOC: an approach to connect Web-based communities. International Journal of Web Based Communities, 2(2):133–142. Brooke, J., Toﬁloski, M., and Taboada, M. (2009). Cross-linguistic sentiment analysis: from English to Spanish. In Proceedings of the 7th International Conference on Recent Advances in NLP, RANLP’09, Borovets, Bulgaria. Buitelaar, P., Arcan, M., Iglesias, C. A., S´anchez-Rada, J. F., and Strapparava, C. (2013). Linguistic linked data for sentiment analysis. In Chiarcos, C., Cimiano, P., Declerck, T., and McCrae, J. P., editors, Proceedings of the 2nd Workshop on Linked Data in Linguistics: Representing and Linking Lexicons, Terminologies and Other Language Data. Collocated with the Conference on Generative Approaches to the Lexicon, LDL’13, pages 1–8, Pisa, Italy. Association for Computational Linguistics. Burby, J. and Brown, A. (2007). Web Analytics deﬁnitions. http: //www.digitalanalyticsassociation.org/Files/PDF_standards/ WebAnalyticsDefinitionsVol1.pdf. Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011). Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’11, pages 1301–1309, Edinburgh, United Kingdom. Association for Computational Linguistics. Cambria, E., Schuller, B., Xia, Y., and Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. Intelligent Systems, IEEE, 28(2):15–21. 271 Cambria, E. and White, B. (2014). Jumping NLP curves: a review of natural language processing research. Computational Intelligence Magazine, IEEE, 9(2):48–57. Carroll, J. J., Bizer, C., Hayes, P., and Stickler, P. (2005). Named graphs, provenance and trust. In Proceedings of the 14th International Conference on World Wide Web, WWW’05, pages 613–622, Chiba, Japan. ACM. Cavnar, W. B. and Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, SDAIR’94, pages 161–175, Las Vegas, USA. Chan, W. S. (2003). Stock price reaction to news and no-news: Drift and reversal after headlines. Journal of Financial Economics, 70:223–260. Chang, H., Lee, D., Eltaher, M., and Lee, J. (2012). @phillies tweeting from philly? predicting twitter user locations with spatial word usage. In Proceedings of the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM’12, pages 111–118, Istambul, Turkey. Chaumartin, F.-R. (2007). Upar7: A knowledge-based system for headline sentiment tagging. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval’07, pages 422–425, Prague, Czech Republic. Association for Computational Linguistics. Cheng, Z., Caverlee, J., and Lee, K. (2010). You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM’10, pages 759–768, Toronto, Canada. ACM. Chetviorkin, I. I., Braslavski, P. I., and Loukachevitch, N. V. (2011). Rule based approach to sentiment analysis. In Proceedings of the Sentiment Analysis Track at the Russian Information Retrieval Evaluation Seminar, ROMIP’11. Clore, G. L., Ortony, A., and Foss, M. A. (1987). The psychological foundations of the aﬀective lexicon. Journal of Personality and Social Psychology, 53(4):751– 755. 272 Codina, J. and Atserias, J. (2012). What is the text of a tweet? In Proceedings of @NLP can u tag #user generated content?! via lrec-conf.org, LREC’12, pages 29–33, Istanbul, Turkey. ELRA. Cohen, J. (1960). A coeﬃcient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46. Corcoran, S. (2009). Deﬁning earned, owned and paid http://blogs.forrester.com/interactive_marketing/2009/12/ defining-earned-owned-and-paid-media.html. media. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–297. Coursey, K., Mihalcea, R., and Moen, W. (2009). Using encyclopedic knowledge for automatic topic identiﬁcation. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL’09, pages 210–218, Boulder, Colorado, USA. Association for Computational Linguistics. Court, D., Elzinga, D., Mulder, S., and Vetvik, O. J. (2009). The consumer decision journey. McKinsey Quarterly, 3:1–11. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M. A., Saggion, H., Petrak, J., Li, Y., and Peters, W. (2011). Text Processing with GATE (Version 6). The University of Sheﬃeld, Department of Computer Science. De Bruyn, A. and Lilien, G. (2008). A multi-stage model of word-of-mouth inﬂuence through viral marketing. International Journal of Research in Marketing, 25(3):151–163. Deane, J., Loren, P., and Terry, R. (2011). Behavioural targeting in online advertising using web surf history analysis and contextual segmentation. International Journal of Electronic Business, 9(3):271–291. Dellarocas, C. (2003). The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Managegement Science, 49(10):1407–1424. 273 Ding, X. and Liu, B. (2007). The utility of linguistic rules in opinion mining. In Proceedings of the 30th Annual International ACM SIGIR Conference, SIGIR’07, pages 811–812, Amsterdam, The Netherlands. ACM. Divol, R., Edelman, D., and Sarrazin, H. (2012). Demystifying social media. McKinsey Quarterly, 12(2):66–77. Dodig-Crnkovic, G. (2002). Scientiﬁc methods in Computer Science. In Proceedings of the Conference for the Promotion of Research in IT at New Universities and at University Colleges in Sweden, Sk¨ovde, Sweden. Droms, R. (1997). RFC 2131 – Dynamic Host Conﬁguration Protocol. https: //www.ietf.org/rfc/rfc2131.txt. Eastlake, D. and Jones, P. (2001). RFC 3174 – US Secure Hash Algorithm 1 (SHA1). https://tools.ietf.org/html/rfc3174. Eckersley, P. (2010). How unique is your Web browser? In Atallah, M. and Hopper, N., editors, Privacy Enhancing Technologies, volume 6205 of Lecture Notes in Computer Science, pages 1–18. Springer Berlin Heidelberg, Berlin, Heidelberg. ECMA (2011). Standard ECMA-262. ECMAScript language speciﬁcation. http: //www.ecma-international.org/ecma-262/5.1/. Edelman, D. (2010). Branding in the Digital Age: You’re Spending Your Money in All the Wrong Places. Harvard Business Review. Egan, J. (1975). Signal detection theory and ROC-analysis. Academic Press series in cognition and perception. Academic Press. Egevang, K. (1994). RFC 1631 – The IP Network Address Translator (NAT). https://www.ietf.org/rfc/rfc1631.txt. Ekman, P. (1994). Moods, emotions, and traits. In Ekman, P. and Davidson, R., editors, The Nature of Emotion: Fundamental Questions, SAS Series, pages 56–58. Oxford University Press. 274 Ekman, P. (2005). Emotion in the Human Face. Series in Aﬀective Science. Oxford University Press. Esuli, A. and Sebastiani, F. (2006). SENTIWORDNET: A publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation, LREC’06, pages 417–422, Genoa, Italy. Fielding, R. T. (2000). Architectural Styles and the Design of Network-based Software Architectures. PhD thesis, University of California, Irvine. AAI9980887. Fielding, R. T. and Reschke, J. (2014a). RFC 7230 – Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. https://tools.ietf.org/html/ rfc7230. Fielding, R. T. and Reschke, J. (2014b). RFC 7231 – Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. https://tools.ietf.org/html/ rfc7231. Fleiss, J. L. (1973). The equivalence of weighted kappa and the intraclass correlation coeﬃcient as measures of reliability. Educational and Psychological Measurement, 33:613–619. Franzen, G. and Goessens, C. (1999). Brands & advertising: how advertising eﬀectiveness inﬂuences brand equity. Admap. Freed, N. and Borenstein, N. (1996). RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One. https://www.ietf.org/rfc/rfc2045.txt. Fung, G. P. C., Yu, J. X., and Lam, W. (2003). Stock prediction: Integrating text mining approach using real-time news. In Proceedings of 2003 IEEE International Conference on Computational Intelligence for Financial Engineering, CIFER’03, pages 395–402, Hong Kong, China. Gabrilovich, E. and Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorisation with encyclopaedic knowledge. In Proceedings of the 21st National Conference on Artiﬁcial Intelligence, volume 2 of AAAI’06, pages 1301–1306, Boston, Massachusetts, USA. AAAI Press. 275 Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence, IJCAI’07, pages 1606–1611, Hyderabad, India. Morgan Kaufmann Publishers Inc. Gamallo, P., Garcia, M., and Pichel, J. R. (2013). A method to lexical normalisation of tweets. In Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padr´o, L., San Vicente, I., Turmo, J., and Zubiaga, A., editors, Proceedings of the Tweet Normalization Workshop co-located with 29th Conference of the Spanish Society for Natural Language Processing, SEPLN’13, pages 44–48, Madrid, Spain. Gangemi, A., Presutti, V., and Reforgiato Recupero, D. (2014). Frame-based detection of opinion holders and topics: A model and a tool. Computational Intelligence Magazine, IEEE, 9(1):20–30. Garc´ıa Moya, L. (2008). Un etiquetador morfol´ogico para el espa˜ nol de Cuba. Master’s thesis, Universidad de Oriente. Facultad de Matem´atica y Computaci´on, Santiago de Cuba, Cuba. Gayo-Avello, D. (2011). Don’t turn social media into another ’literary digest’ poll. Communications of the ACM, 54(10):121–128. Gendron, M. and Feldman Barrett, L. (2009). Reconstructing the past: A century of ideas about emotion in psychology. Emotion Review, 1(4):316–339. Goldberg, A. B., Fillmore, N., Andrzejewski, D., Xu, Z., Gibson, B., and Zhu, X. (2009). May all your wishes come true: A study of wishes and how to recognize them. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL’09, pages 263–271, Boulder, Colorado. Association for Computational Linguistics. G´omez-P´erez, A., Fern´andez-L´opez, M., and Corcho, O. (2004). Ontological Engineering: with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web. First Edition. Advanced Information and Knowledge Processing. Springer. 276 Graves, M., Constabaris, A., and Brickley, D. (2007). FOAF: connecting people on the Semantic Web. Cataloging & Classiﬁcation Quarterly, 43:191–202. Gruhl, D., Guha, R., Kumar, R., Novak, J., and Tomkins, A. (2005). The predictive power of online chatter. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD’05, pages 78–87, Chicago, Illinois, USA. ACM. Gupta, P. and Harris, J. (2010). How e-WOM recommendations inﬂuence product consideration and quality of choice: a motivation to process information perspective. Journal of Business Research, 63(9–10):1041–1049. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1):10–18. Han, X., Wei, W., Miao, C., Mei, J., and Song, H. (2014). Context-aware personal information retrieval from multiple social networks. Computational Intelligence Magazine, IEEE, 9(2):18–28. Harding, W., Reed, A., and Gray, R. (2001). Cookies and web bugs: What they are and how they work together. Information Systems Management, 18:17–24. Hatzivassiloglou, V. and McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the 8th Conference on European Chapter of the Association for Computational Linguistics, EACL’97, pages 174–181, Madrid, Spain. Association for Computational Linguistics. Hennig-Thurau, T., Malthouse, E. C., Friege, C., Gensler, S., Lobschat, L., Rangaswamy, A., and Skiera, B. (2010). The impact of new media on customer relationships. Journal of Service Research, 13(3):311–330. Hovi, E., Markman, V., Martell, C., and Uthus, D. (2013). Analyzing microtext. In Proceedings of the 2013 AAAI Spring Symposia, AAAI’13, page vii, Palo Alto, California, USA. Association for the Advancement of Artiﬁcial Intelligence. 277 Hu, X. and Cercone, N. (2004). A data Warehouse/OLAP framework for web usage mining and business intelligence reporting. International Journal of Computational Intelligence Systems, 19:585–606. IEEE (1990). IEEE standard ﬂossary of software engineering terminology. IEEE Standard 610.12-1990, Standards Coordinating Committee of the Computer Society of the IEEE. IEEE (1995a). IEEE guide for software quality assurance planning. IEEE Standard 730.1-1995, Software Engineering Standards Committee of of the IEEE Computer Society. IEEE (1995b). IEEE standard for developing software life cycle processes. IEEE Standard 1074-1995, IEEE Computer Society. IEEE (1997). IEEE standard for developing software life cycle processes. IEEE Standard 1074-1997, IEEE Computer Society. ´ Jaccard, P. (1901). Etude comparative de la distribution ﬂorale dans une portion des Alpes et des Jura. Bulletin del la Soci´et´e Vaudoise des Sciences Naturelles, 37:547–579. Joshi, M., Das, D., Gimpel, K., and Smith, N. A. (2010). Movie reviews and revenues: An experiment in text regression. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10, pages 293–296, Los Angeles, California, USA. Association for Computational Linguistics. Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall. Katz, P., Singleton, M., and Wicentowski, R. (2007). SWAT-MP: The SemEval2007 Systems for Task 5 and Task 14. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval’07, pages 308–313, Prague, Czech Republic. Association for Computational Linguistics. 278 Kaufmann, M. and Jugal, K. (2010). Syntactic normalization of twitter messages. In Proceedings of the International Conference on Natural Language Processing, ICON’10, pages 2–8, Kharagpur, India. Kaushik, A. (2007). Web Analytics: an hour a day. John Wiley & Sons, Incorporated. Kaushik, A. (2009). Web Analytics 2.0: the art of online accountability and science of customer centricity. Wiley. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., and Wright, S. E. (2008). ISOcat: corralling data categories in the wild. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., and Tapias, D., editors, Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC’08, pages 887–891, Marrakech, Morocco. European Language Resources Association (ELRA). Kimball, R., Reeves, L., Thornthwaite, W., Ross, M., and Thornwaite, W. (1998). The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses. John Wiley & Sons, Inc., New York, NY, USA, 1st edition. Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modelling. John Wiley & Sons, Inc., New York, USA, 2nd edition. Kleinginna, P. R. and Kleinginna, A. M. (1981). A categorized list of emotion definitions, with suggestions for a consensual deﬁnition. Motivation and Emotion, 5(4):345–379. Kohavi, R. and Provost, F. (1998). 30(2/3):271–274. Glossary of terms. Machine Learning, Kothari, C. (2004). Research Methodology: Methods and Techniques. New Age International Publishers Limited, second edition. Kowalski, G. (1997). Information Retrieval Systems. Theory and Implementation. Kluwer Academic Publishers. 279 Kozareva, Z., Navarro, B., V´azquez, S., and Montoyo, A. (2007). UA-ZBSA: A headline emotion classiﬁcation through Web information. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval’07, pages 334–337, Prague, Czech Republic. Association for Computational Linguistics. Kozinets, R. V., de Valck, K., Wojnicki, A. C., and Wilner, S. J. (2010). Networked narratives: Understanding word-of-mouth marketing in online communities. Journal of Marketing, 74(2):71–89. Larsen, B. and Aone, C. (1999). Fast and eﬀective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’99, pages 16–22, San Diego, California, USA. le Cessie, S. and van Houwelingen, J. (1992). Ridge estimators in logistic regression. Applied Statistics, 41(1):191–201. Leech, G. and Wilson, A. (1996). EAGLES. Recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/ annotate/annotate.html. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710. Lewis, E. (1903). Advertising department: Catch-line and argument. The BookKeeper, 15:124–128. Li, P., Dong, X. L., Maurino, A., and Srivastava, D. (2011). Linking temporal records. Proceedings of the VLDB Endowment,, 4(11):956–967. Liu, B. (2010). Sentiment analysis and subjectivity. In Indurkhya, N. and Damerau, F. J., editors, Handbook of Natural Language Processing, Second Edition, pages 1–38. CRC Press, Taylor and Francis Group, Boca Raton, USA. Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool. Maldonado, S. (2009). Anal´ıtica Web: medir para triunfar. ESIC Editorial, Pozuelo de Alarc´on, Madrid. 280 Mayer, J., Narayanan, A., and Stamm, S. (2011). Do Not Track: a Universal third-party Web tracking opt out. https://tools.ietf.org/html/ draft-mayer-do-not-track-00. McCarthy, E. J. and Brogowicz, A. A. (1981). Basic marketing: a managerial approach. Irwin Series in Marketing. R.D. Irwin. McPherson, M., Smith-Lovin, L., and Cook, J. M. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444. Mel’ˇcuk, I. (1996). Lexical functions: A tool for the description of lexical relations in a lexicon. In Wanner, L., editor, Lexical functions in lexicography and natural language processing, Studies in language companion series, pages 37–102. John Benjamins, Amsterdam, Philadelphia, USA. Mihalcea, R. (2007). Using Wikipedia for automatic word sense disambiguation. In Sidner, C. L., Schultz, T., Stone, M., and Zhai, C., editors, Proceedings of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT’07, pages 196–203, Rochester, NY, USA. The Association for Computational Linguistics. Miles, A., Matthews, B., Wilson, M., and Brickley, D. (2005). SKOS core: simple knowledge organisation for the Web. In Proceedings of the 2005 International Conference on Dublin Core and Metadata Applications: Vocabularies in Practice, DCMI’05, pages 1:1–1:9, Madrid, Spain. Dublin Core Metadata Initiative. Mishne, G. and Glance, N. (2006). Predicting movie sales from blogger sentiment. In Proceedings of the AAAI Symposium on Computational Approaches to Analysing Weblogs (), AAAI-CAAW’06, pages 155–158. Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P., and Rosenquist, J. N. (2011). Understanding the demographics of Twitter users. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media, ICWSM’11, pages 554–557, Barcelona, Spain. Mockapetris, P. (1987). RFC 1035 – Domain Names – Implementation and Speciﬁcation. https://www.ietf.org/rfc/rfc1035.txt. 281 Mullen, T. and Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. In Proceedings of Conference on Empirical Methods in Natural Language Processing, EMNLP’04, pages 412–418. Ng, S. and Hill, S. R. (2009). The impact of negative word-of-mouth in Web 2.0 on brand equity. In Proceedings of the 2009 ANZMAC Annual Conference, ANZMAC’09, Melbourne, Australia. Monash University. Nielsen (2012a). Global trust in advertising and brand messages. http://www.nielsen.com/us/en/insights/reports/2013/ global-trust-in-advertising-and-brand-messages.html. Nielsen (2012b). State of the media – the social media report. http://www.nielsen.com/us/en/insights/reports/2012/ state-of-the-media-the-social-media-report-2012.html. Noble, S., Cooperstein, D. M., Kemp, M. B., and Munchbach, C. (2010). It’s time to bury the marketing funnel – an empowered report. https://www.forrester.com/Its+Time+To+Bury+The+Marketing+ Funnel/fulltext/-/E-res57495. Nottingham, M. and Sayre, R. (2005). RFC 4287 – The Atom Syndication Format. https://tools.ietf.org/html/rfc4287. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web, WWW’06, pages 83–92, Edinburgh, Scotland, UK. ACM. O’Leary, D. (2013). Artiﬁcial intelligence and big data. Intelligent Systems, IEEE, 28(2):96–99. Oliver, R. (1989). Processing of the satisfaction response in consumption: A suggested framework and research propositions. Journal of Consumer Satisfaction, Dissatisfaction and Complaining Behaviour, 2(1):1–16. OMG (2011). OMG Uniﬁed Modelling Language (OMG UML), Superstructure. http://www.omg.org/spec/UML/2.4.1/Superstructure/PDF/. 282 Ortony, A., Clore, G., and Collins, A. (1990). The Cognitive Structure of Emotions. Cambridge University Press. Padr´o, L. and Stanilovsky, E. (2012). FreeLing 3.0: towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference, LREC’12, pages 2473–2479, Istanbul, Turkey. ELRA. Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The PageRank citation ranking: Bringing order to the Web. Technical Report 1999-66, Stanford InfoLab. Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends. Information Retrieval, 2(1-2):1–135. Phillips, D. M. and Baumgartner, H. (2002). The role of consumption emotions in the satisfaction response. Journal of Consumer Psychology, 12(3):243–252. Plutchik, R. (1989). Emotion: Theory, Research, and Experience. Acad. Press. Pookulangara, S. and Koesler, K. (2011). Cultural inﬂuence on consumers’ usage of social networks and its’ impact on online purchase intentions. Journal of Retailing and Consumer Services, 18(4):348–354. Postel, J. (1981). RFC 791 – Internet Protocol - DARPA Internet Program, Protocol Speciﬁcation. https://www.rfc-editor.org/rfc/rfc791.txt. Prabowo, R. and Thelwall, M. (2009). Sentiment analysis: a combined approach. Journal of Informetrics, 3(2):143–157. Quinlan, R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA. Ramanand, J., Bhavsar, K., and Pedanekar, N. (2010). Wishful thinking: Finding suggestions and ’buy’ wishes from product reviews. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, CAAGET’10, pages 54–61, Los Angeles, California, USA. Association for Computational Linguistics. 283 Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent user attributes in Twitter. In Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, SMUC’10, pages 37–44, Toronto, Canada. ACM. Reese, W. (2008). Nginx: the high-performance web server and reverse proxy. Linux Journal. Rentoumi, V., Petrakis, S., Klenner, M., Vouros, G. A., and Karkaletsis, V. (2010). United we stand: Improving sentiment analysis by joining machine learning and rule-based methods. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D., editors, Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC’10), pages 1089–1094, Valletta, Malta. European Language Resources Association (ELRA). Richins, M. L. (1997). Measuring emotions in the consumption experience. Journal of Consumer Research, 24(2):127–146. Rosch, E. (1978). Principles of categorization. In Rosch, E. and Lloyd, B., editors, Cognition and Categorization, pages 27–48. John Wiley & Sons Inc. Sadikov, E., Parameswaran, A. G., and Venetis, P. (2009). Blogs as predictors of movie success. In Proceedings of the Third International ICWSM Conference, ICWSM’09, pages 304–307. S´anchez-Rada, J. F. and Iglesias, C. A. (2013). Onyx: describing emotions on the Web of data. In Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI, volume 1096 of ESSEM’13, pages 71–82, Torino, Italy. AI*IA, Italian Association for Artiﬁcial Intelligence, CEUR-WS. 284 Santorini, B. (1991). Part-Of-Speech tagging guidelines for the Penn Treebank project (3rd revision, 2nd printing). Technical report, Department of Linguistics, University of Pennsylvania. Schindler, R. and Bickart, B. (2005). Published word of mouth: referable, consumer-generated information on the internet. Online Consumer Psychology: Understanding and Inﬂuencing Consumer Behaviour in the Virtual World, pages 35–61. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, NeMLaP’94, Manchester, UK. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656. Shannon, C. E. and Warren, W. (1949). The mathematical theory of communication. University of Illinois Press. Sharda, R. and Delen, D. (2006). Predicting box-oﬃce success of motion pictures with neural networks. Expert Systems Applications, 30(2):243–254. Shaver, P., Schwartz, J., Kirson, D., and O’Connor, C. (1987). Emotion knowledge: further exploration of a prototype approach. Journal of Personality and Social Psychology, 52(6):1061–1086. Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4):13–22. Shinavier, J. (2010). Real-time #SemanticWeb in <= 140 chars. In Proceedings of the WWW2010 Workshop on Linked Data on the Web, WWW’10, Raleigh, North Carolina, USA. Sidorov, G., Miranda-Jim´enez, S., Viveros-Jim´enez, F., Gelbukh, A., CastroS´anchez, N., Vel´asquez, F., D´ıaz-Rangel, I., Su´arez-Guerra, S., Trevi˜ no, A., and Gordon, J. (2013). Empirical study of machine learning based approach for opinion mining in tweets. In Proceedings of the 11th Mexican International 285 Conference on Advances in Artiﬁcial Intelligence - Volume Part I, MICAI’12, pages 1–14, San Luis Potosí, Mexico. Springer-Verlag. Sison, A. and J., F. (2005). Ethical aspects of e-commerce: data subjects and content. International Journal of Internet Marketing and Advertising, 3:5–18. Sommerville, I. (2007). Software Engineering. International Computer Science Series. Addison-Wesley, eighth edition. Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3):287–333. Sterne, J. (2010). Social Media Metrics: How to Measure and Optimize Your Marketing Investment. John Wiley & Sons. Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M. (1966). The General Inquirer: A Computer Approach to Content Analysis. M.I.T. Press. Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14: Aﬀective Text. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval’07, pages 70–74, Prague, Czech Republic. Association for Computational Linguistics. Su´arez-Figueroa, M. C., G´omez-P´erez, A., and Fern´andez-L´opez, M. (2012). The NeOn methodology for ontology engineering. In Su´arez-Figueroa, M. C., G´omez-P´erez, A., Motta, E., and Gangemi, A., editors, Ontology Engineering in a Networked World, chapter 2, pages 9–34. Springer. Subramanyam, R. (2011). The relationship between social media buzz and TV ratings. http://www.nielsen.com/us/en/insights/news/2011/ the-relationship-between-social-media-buzz-and-tv-ratings.html. Taboada, M., Brooke, J., Toﬁloski, M., Voll, K., and Stede, M. (2011). Lexiconbased methods for sentiment analysis. Computational Linguistics, 37(2):267– 307. 286 Tetlock, P. C., Saar-Tsechansky, M., and Macskassy, S. (2008). More than words: Quantifying language to measure ﬁrms’ fundamentals. Journal of Finance, 63(3):1437–1467. Thayer, R. (1989). The Biopsychology of Mood and Arousal. Oxford University Press, New York, NY. Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classiﬁcation of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL’02, pages 417–424, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Valitutti, A., Strapparava, C., and Stock, O. (2004). Developing aﬀective lexical resources. PsychNology Journal, 2(1):61–83. van Bruggen, G. H., Antia, K. D., Jap, S. D., Reinartz, W. J., and Pallas, F. (2010). Managing marketing channel multiplicity. Journal of Service Research (JSR), 13(3):331–340. Vaughn, R. (1986). How advertising works: A planning model revisited. Journal of Advertising Research, 26:57–66. Vilares, D., Alonso, M., and G´omez-Rodr´ıguez, C. (2013). Clasiﬁcaci´on de polaridad en textos con opiniones en espa˜ nol mediante an´alisis sint´actico de dependencias. Procesamiento del Lenguaje Natural, 50(0). V´azquez, S., Mu˜ noz-Garc´ıa, O., Campanella, I., Poch, M., Fisas, B., Bel, N., and Andreu, G. (2014). A classiﬁcation of user-generated content into consumer decision journey stages. Neural Networks, 56:68–81. Wang, K., Thraser, C., and Hsu, P. B.-J. (2011). Web Scale NLP: a case study on URL word breaking. In Proceedings of the 20th international conference on World Wide Web, WWW’11, pages 357–366, Hyderabad, India. ACM. Wang, X., Yu, C., and Wei, Y. (2012). Social media peer communication and impacts on purchase intentions: A consumer socialization framework. Journal of interactive marketing: a quarterly publication from the Direct Marketing Educational Foundation, 26(4):198–209. 287 Weber, L. (2007). Marketing to the social Web: how digital customer communities build your business. Wiley. Westbrook, R. A. and Oliver, R. L. (1991). The dimensionality of consumption emotion patterns and consumer satisfaction. Journal of Consumer Research, 18(1):84–91. Westerski, A., Iglesias, C. A., and Tapia, F. (2011). Linked opinions: describing sentiments on the structured Web of Data. In Proceedings of the 4th International Workshop on Social Data on the Web, SDoW’11, Bonn, Germany. Wiebe, J., Wilson, T., and Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2-3):165– 210. Wu, X. and He, Z. (2011). Identifying wish sentence in product reviews. Journal of Computational Information Systems, 7:1607–1613. Yergeau, F. (2003). RFC 3629 – UTF-8, a transformation format of ISO 10646. https://tools.ietf.org/html/rfc3629. Zhang, W. and Skiena, S. (2009). Improving movie gross prediction through news analysis. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT’09, pages 301–304, Washington, DC, USA. IEEE Computer Society. Zhao, Y. and Karypis, G. (2001). Criterion functions for document clustering: experiments and analysis. Technical report, Department of Computer Science, University of Minnesota. 288

Methods And Techniques For Segmentation Of Consumers In Social

Rating

Date

Size

Views

Categories

Share

Transcript