Transcription

ALMA MATER STUDIORUM — UNIVERSITÀ DI BOLOGNADISI - Dipartimento di Informatica: Scienza e IngegneriaPhD in Computer Science and EngineeringCiclo XXXISettore Concorsuale: 09/H1Settore Disciplinare: ING-INF/05BIG DATA MINING AND MACHINE LEARNINGTECHNIQUES APPLIED TO REAL WORLD SCENARIOSCandidatoDott. Ing. ANDREA PAGLIARANISupervisoreChiar.mo Prof. Ing. GIANLUCA MOROTutorChiar.mo Prof. Ing. CLAUDIO SARTORICoordinatoreChiar.mo Prof. Ing. PAOLO CIACCIAFINAL EXAMINATION YEAR 2019

iiiBe modest and determined,and do what you love to do,always.AcknowledgementsMost of the material that got integrated as a single work in this thesis has beenwritten in publications I co-authored. I am thankful to those great researchersI have collaborated with, for their guidance, their suggestions and their help. Iwould like to thank first Gianluca Moro, who carefully supervised me in thepast three years and whose deep expertise has been pivotal for my research.Claudio Sartori, who co-supervised me, sharing lots of essential research ideas,also deserves a special mention. I am grateful for the opportunity Claudio gaveme to contribute to TOREADOR, an European project that kickstarted me inthe world of research. I also desire to thank Giacomo Domeniconi and RobertoPasolini, who shared the APICe laboratory with me and tried to improve mytechnical skills. A special thank to Elizabeth Daly, who supervised me duringmy three-month research period at IBM Research - Ireland Lab. Finally, I wantto thank these other great researchers and professionals I have collaborated with:Oznur Alkan, Adi Botea, Stefano Lodi, Beniamino Di Martino, Cesare Bandirali,Salvatore D’Angelo, Antonio Esposito, Karin Pasini.Andrea Pagliarani, 8th February 2019

ContentsAbstractxi1About this thesis1.1 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123IBackground and Motivation52The age of big data2.1 What is big data? . . . . . . . . . . . . .2.1.1 The features of big data . . . . . .2.1.2 The value of big data . . . . . . .2.1.3 Open challenges . . . . . . . . .2.2 Related technologies . . . . . . . . . . .2.2.1 Cloud computing . . . . . . . . .2.2.2 IoT . . . . . . . . . . . . . . . .2.2.3 Data mining and machine 23233Recommender systems for job search5.1 Recruitment systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.2 Job recommendation systems . . . . . . . . . . . . . . . . . . . . . . . . . . .5.3 Career pathway recommendation systems . . . . . . . . . . . . . . . . . . . .35363739.Sentiment analysis3.1 Introduction . . . . . . . . . . . . . . . . .3.2 Techniques . . . . . . . . . . . . . . . . .3.2.1 Feature selection . . . . . . . . . .3.2.2 Sentiment classification . . . . . .3.2.3 Cross-domain and transfer learning.Stock market analysis4.1 Introduction . . . . . . . . . . . . . . . . . .4.2 Traditional approaches . . . . . . . . . . . .4.3 Methods based on text and news analysis . .4.4 Methods based on social media . . . . . . . .4.4.1 The predictive value of Twitter . . . .4.4.2 Twitter-based stock market predictionv.

6II78Big Data Analytics6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.2 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.2.1 Horizontal scaling platforms . . . . . . . . . . . . . . . . . . . .6.2.2 Vertical scaling platforms . . . . . . . . . . . . . . . . . . . . .6.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6.3.1 Local learning and model fusion for multiple information sources6.3.2 Mining from sparse, uncertain and incomplete data . . . . . . . .6.3.3 Mining complex and dynamic data . . . . . . . . . . . . . . . . .Algorithms and Methods for Sentiment Analysis49Markov techniques for transfer learning and sentiment analysis7.1 Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.1.2 Markov theory applications . . . . . . . . . . . . . . . . . . . . .7.2 A Markov method for cross-domain sentiment classification . . . . . . . .7.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.2 Text pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.3 Learning phase . . . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.4 Classification phase . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.5 Computational complexity . . . . . . . . . . . . . . . . . . . . . .7.2.6 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . .7.2.7 Markov chain performance with different feature selection methods7.2.8 Comparison with the state of the art . . . . . . . . . . . . . . . . .7.3 Variants of the Markov chain method . . . . . . . . . . . . . . . . . . . . .7.3.1 Document splitting into sentences . . . . . . . . . . . . . . . . . .7.3.2 Polarity-driven state transitions . . . . . . . . . . . . . . . . . . .7.3.3 Analysis and results . . . . . . . . . . . . . . . . . . . . . . . . .7.4 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Deep methods to enhance text understanding in big dataset analysis8.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.1.2 The impact on sentiment classification . . . . . . . . . . . . .8.2 Distributed text representation for transfer learning and cross-domain8.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.2.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . .8.2.3 In-domain analysis . . . . . . . . . . . . . . . . . . . . . . .8.2.4 Cross-domain analysis . . . . . . . . . . . . . . . . . . . . 66676871.737373757677808282

8.38.4III9IV8.2.5 Multi-source training . . . . . . . . . . . . . . . . . . . . . . . . . . .8.2.6 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Fine-tuning of memory-based deep neural networks for transfer learning . . . .8.3.1 Gated recurrent unit . . . . . . . . . . . . . . . . . . . . . . . . . . .8.3.2 In-domain analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.3.3 Cross-domain analysis . . . . . . . . . . . . . . . . . . . . . . . . . .8.3.4 Fine-tuning for transfer learning and cross-domain . . . . . . . . . . .8.3.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Combining memory-based deep architectures and distributed text representationsfor sentiment classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.4.1 Differentiable neural computer . . . . . . . . . . . . . . . . . . . . . .8.4.2 Global vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.4.3 The impact of labeled training data on memory-based networks . . . .8.4.4 Fine-tuning for cross-domain sentiment classification . . . . . . . . . .8.4.5 Large-scale document sentiment classification . . . . . . . . . . . . . .8.4.6 Single-sentence sentiment classification . . . . . . . . . . . . . . . . .8.4.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Text Mining Approaches for Stock Market AnalysisMethods for DJIA index prediction9.1 Background . . . . . . . . . . . . . . .9.1.1 Noise detection . . . . . . . . .9.2 Detecting and mining relevant tweets . .9.2.1 Benchmark text set . . . . . . .9.2.2 Vector space model construction9.2.3 Detecting noisy tweets . . . . .9.2.4 Experimental evaluation . . . .9.3 A trading protocol for DIJA . . . . . . .9.3.1 Protocol . . . . . . . . . . . . .9.3.2 Experiments . . . . . . . . . .9.4 Final remarks . . . . . . . . . . . . . 13114115116117117119119122123Big Data Mining and Machine Learning Methods for Job Search 12510 A skillset-based job recommender10.1 Building a hierarchy of job positions10.1.1 Methodology . . . . . . . .10.1.2 Experimental setup . . . . .10.1.3 Job hierarchy . . . . . . . .vii.127127127129130

10.2 Job recommendation10.2.1 Methodology10.2.2 Results . . .10.3 Final remarks . . . .13013413413811 Modeling job transitions and recommending career pathways11.1 Modeling job transitions . . . . . . . . . . . . . . . . . . .11.1.1 Clustering of similar jobs . . . . . . . . . . . . . . .11.1.2 Building a job graph . . . . . . . . . . . . . . . . .11.2 MDP-based career pathway recommendation . . . . . . . .11.2.1 Recommendation requirements . . . . . . . . . . .11.2.2 MDP recommender . . . . . . . . . . . . . . . . . .11.3 Experimental evaluation . . . . . . . . . . . . . . . . . . .11.3.1 Evaluation protocol . . . . . . . . . . . . . . . . . .11.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . .11.4 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . .139139141142143143144146147148151V.Accelerate the Development of Big Data Analytics15312 Towards vendor-agnostic implementation of big data analytics15512.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15512.2 Parallel programming primitives . . . . . . . . . . . . . . . . . . . . . . . . . 15612.2.1 Primitives definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 15712.3 Agnostic implementations of data mining algorithms based on parallel primitives 15812.3.1 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15912.3.2 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16112.3.3 Apriori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16312.3.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16512.4 From vendor-agnostic implementation to vendor-specific platforms . . . . . . . 16812.4.1 Compiling primitives into skeletons . . . . . . . . . . . . . . . . . . . 16812.5 Spark implementation of k-means based on parallel primitives . . . . . . . . . 17112.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17312.6.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17312.6.2 Performance check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17312.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174VIConclusion17513 Results achieved and future work17713.1 Methods for sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 177viii

13.213.313.413.5Methods for stock market analysis . . . . . . . . . . . . . . . . . . . .Recommendation methods for job search . . . . . . . . . . . . . . . . .Boosting the development of big data analytics . . . . . . . . . . . . .Ongoing and future work . . . . . . . . . . . . . . . . . . . . . . . . .13.5.1 Improvement of the Markov techniques for sentiment analysis .13.5.2 Enhancing deep learning approaches for sentiment analysis . . .13.5.3 Expanding the investigation on stock market analysis . . . . . .13.5.4 Improving the performance of job recommendation . . . . . . .13.5.5 Enhancing the effectiveness of career pathway recommendation13.5.6 Supporting the development of big data analytics . . . . . . . .Bibliography.179180181181181182182182183183184ix

x

AbstractData mining techniques allow the extraction of valuable information from heterogeneous andpossibly very large data sources, which can be either structured or unstructured. Unstructureddata, such as text files, social media, mobile data, are much more than structured data, andgrow at a higher rate. Their high volume and the inherent ambiguity of natural language makeunstructured data very hard to process and analyze. Appropriate text representations are thereforerequired in order to capture word semantics as well as to preserve statistical information, e.g.word counts. In Big Data scenarios, scalability is also a primary requirement. Data mining andmachine learning approaches should take advantage of large-scale data, exploiting abundantinformation and avoiding the curse of dimensionality. The goal of this thesis is to enhancetext understanding in the analysis of big data sets, introducing novel techniques that can beemployed for the solution of real world problems. The presented Markov methods temporarilyachieved the state-of-the-art on well-known Amazon reviews corpora for cross-domain sentimentanalysis, before being outperformed by deep approaches in the analysis of large data sets. A noisedetection method for the identification of relevant tweets leads to 88.9% accuracy in the DowJones Industrial Average daily prediction, which is the best result in literature based on socialnetworks. Dimensionality reduction approaches are used in combination with LinkedIn users’skills to perform job recommendation. A framework based o