Deep Learning on Big Data Sets in the Cloud with
Apache Spark and Google TensorFlow

Patrick GLAUNER and Radu STATE
Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg

Abstract:
Machine learning is the branch of artificial intelligence giving computers the ability to learn
patterns from data without being explicitly programmed. Deep Learning is a set of cutting-edge machine learning algorithms that are inspired by how the human brain works. It allows to selflearn feature hierarchies from the data rather than modeling hand-crafted features. It has proven to significantly improve performance in challenging data analytics problems. In this tutorial, we will first provide an introduction to the theoretical foundations of neural networks and Deep Learning. Second, we will demonstrate how to use Deep Learning in a cloud using a distributed environment for Big Data analytics. This combines Apache Spark and TensorFlow, Google’s in-house Deep Learning platform made for Big Data machine learning applications. Practical demonstrations will include character recognition and time series forecasting in Big Data sets. Attendees will be provided with code snippets that they can easily amend in order to analyze their own data. A related, but shorter tutorial focusing on Deep Learning on a single computer was given at the Data Science Luxembourg Meetup in April 2016. It was attended by 70 people making it the most attended event of this Meetup series in Luxembourg ever since its beginning.

1. Intended audience
This tutorial assumes no prior experience with Apache Spark, machine learning, Deep Learning or TensorFlow. Attendees will be able to acquire both, the theoretical foundations and hands-on experience, in this tutorial. Attendees with prior experience in machine learning will benefit from this part by experiencing a comprehensive rehearsal of the theoretical foundations. However, this tutorial will include advanced topics of Deep Learning such as new regularization methods or long-short term memories (LSTM) which primarily focus on attendees with prior experience in this domain. This part can be skipped by beginners and will not be crucial to their overall learning experience. In order to fully benefit from this tutorial, attendees should bring their own laptop. This will allow them to perform experiments on their computer at the same time and to discuss practical questions.

2. Learning outcome
Attendees will get an understanding of what machine learning is and how Deep Learning, its cuttingedge flavor, works. They will not only learn how to apply a distributed environment to Big Data analytics that can be deployed in a cloud. Rather, they will experience it using Apache Spark and Google TensorFlow on real and Big Data sets. This knowledge will allow them to apply these techniques and infrastructure to their own analytics problems in a cloud.

3. Description
3.1 Motivation
Machine learning allows computers to learn from data without being explicitly programmed. However, hand-crafting features from raw data input is a major challenge in machine learning. Deep Learning allows to self-learn increasingly more complex feature hierarchies from the raw data input. Deep Learning builds on top of the theory of neural networks, which are celebrating a comeback under this new term. Deep Learning has proven to significantly outperform other learning algorithms in a variety of tasks, such as image recognition1, speech recognition2 or winning the game of Go3. However, Deep Learning is not an easy-to-use silver bullet and requires intensive training. To date, there is no comprehensive book on this topic and expertise must be painfully collected from many different sources. Therefore, the goal of this tutorial is to provide a comprehensive introduction to the foundations of Deep Learning. Another shortcoming of Deep Learning is the potentially long training time of a deep neural network. TensorFlow is Google’s in-house Deep Learning platform that allows to efficiently train deep neural networks on GPUs. A different approach is to use map reduce architectures such as Apache Spark or GPUs. In this tutorial, this effectiveness of a combination of both will be shown on real Big Data sets and how to deploy it in a cloud.

3.2 Outline of the proposed content
The proposed structure of this tutorial is as follows:
1. This tutorial will begin with a quick introduction to the most relevant foundations of machine learning.
2. It will then provide a comprehensive introduction to neural networks, a learning algorithm that is inspired by how the human brain works. This also includes a discussion of the limitations of backpropagation, the traditional neural network training method.
3. Neural networks are the foundation of Deep Learning, which are basically a neural network with many layers of neurons. In this section, Deep Learning will be presented to the audience and how these new training methods overcome the limitations of backpropagation in order to efficiently train powerful deep neural networks.
4. Training Deep Learning architectures is time-consuming. However, training neural networks is basically a series of matrix multiplications. Matrix multiplications can be efficiently distributed. Typical distribution methods include map reduce and training on GPUs.
5. Apache Spark4 uses map reduce in order to distribute computations among nodes. In contrast, Google TensorFlow5 allows to distribute training on one or multiple GPUs. Both approaches will be presented and also how they can be combined to take advantage of both concepts to achieve the most efficient outcome in a cloud.
6. In the first practical demonstration, multiple deep neural networks are trained to recognize characters using the notMNIST dataset6, which are characters A-J of different fonts. This will include a discussion of convolutional neural networks (CNN), which are inspired by how the human vision system works.
7. In the second practical demonstration, multiple deep neural networks are trained to forecast a time series. This will include a discussion of recurrent neural networks (RNN) which are able to process temporal information. Furthermore, long-short term memories (LSTM) will be discussed, a modular and highly effective type of RNN. Furthermore, advanced time series forecasting such as for electricity load coreacasting using a dataset of the “Global Energy Forecasting Competition 2012 – Load Forecasting” Kaggle challenge7 will be discussed.

4. Prior tutorials
A related tutorial was given at the Data Science Luxembourg Meetup8 in April 2016 under the title “Deep Learning with TensorFlow”9. That 1-hour tutorial assumed expertise in machine learning and focused on the theoretical foundations of Deep Learning and how to apply regular deep feed-forward neural networks in TensorFlow to the notMNIST dataset for character recognition. It was attended by approximately 70 people, who asked many questions and their feedback was consistently positive. This was the most popular Data Science Luxembourg Meetup event ever since this monthly meet up series was started in November 2012. A tutorial on Deep Learning for load forecasting10 was accepted at IEEE PES Innovative Smart Grid Technologies (ISGT), Europe11 and will be given in October 2016. However, the focus will be on Deep Learning on a single computer using TensorFlow for time series forecasting. This 3-hour tutorial will be different in the following ways:

  • Use of Apache Spark combined with TensorFlow taking advantage of a distributed
    environment in order to efficiently process Big Data sets in a cloud.
  • The length of this tutorial allows to also cover CNNs and not just regular feed-forward
  • architectures for the image recognition example.
  • It will include not only RNNs in the time series example but also provide a comparison to other state-of-the-art models such as Hidden Markov Models.
  • Prior machine learning experience will not be assumed and the theoretical foundations will be covered in the beginning.
  • It will focus on Deep Learning and skip the last part on the future of artificial intelligence and the technological singularity.

5. Materials
A comprehensive tutorial slide deck will be provided, which contains figures, definitions, explanations, relevant parts of code snippets and annotated bibliography. The complete and functional code snippets will be provided. In order to make them work, a list of dependencies to required libraries will also be provided, so that attendees can easily install them. All code snippets will be able to be deployed in a cloud to speed up training time.

6. Bio sketch
Patrick GLAUNER graduated as valedictorian with a B.Sc. degree in computer science from Karlsruhe University of Applied Sciences in 2012 and received the M.Sc. degree in machine learning from Imperial College London in 2015. He was a Fellow at CERN, the European Organization for Nuclear Research, worked at SAP and is an alumnus of the German National Academic Foundation (Studienstiftung des deutschen Volkes). He is currently a Ph.D. student in machine learning in the Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, under the supervision of Dr. Radu STATE. He also holds an adjunct faculty appointment at Karlsruhe University of Applied Sciences where he teaches artificial intelligence. His interests include anomaly detection, big data, computer vision, deep learning, time series.

Radu STATE received the M.Sc. degree from the Johns Hopkins University, Baltimore, MD, USA, and the Ph.D. degree and a HDR from University of Lorraine, Nancy, France. He is a Senior Researcher with the Interdisciplinary Center on Security and Trust in Luxembourg, where he heads the SEDAN research group. He was a Professor at the University of Lorraine and a Senior Researcher at INRIA Nancy, Grand Est. Having authored more than 100 papers, his research interests cover network and system security and management.

Survey Analytics from Questionnaires and Textual Social
Media Analytics. With Accompanying Practical Sessions,
examples and case studies in English

Prof Fionn Murtagh and Dr Mohsen Farid
Big Data Labs, University of Derby, United Kingdom

1. Course Description
The work of the celebrated social scientist Pierre Bourdieu (1930-2002) includes the thoughtful and creative use of the Correspondence Analysis, published in English in 1984, with title Distinction. It is on such a geometric data analysis approach that this course is based. The focus is: (1) interpretation of results, graphical displays and other outputs, (2) practical implementation using the R statistical and visualization environment, and (3) providing intuition, and full understanding, relating to the geometry and statistical processing. We use data collected in various questionnaires, starting from work by Bourdieu on cultural taste. Other questionnaire analysis case studies will be related to transport, cooking and lifestyle, student experience, consumer behavior, and music appreciation. Next the questionnaire outcomes express both closed, fixed format questions, and, conjointly analyzed, free text responses. Finally studied will be data sourced from social media micro-blogging, i.e. Twitter. Data Sources: Questionnaire Numerical Scoring Responses, Free Text Responses, and Twitter Data Sources.

2. Syllabus
Tools
The course uses the R programming and visualization language
Topics
In accompanying online course materials, there will be a practical introduction to the R language and environment. This is for participants who have not used R before.
Part 1: Questionnaire analysis case study: taking the Bourdieu taste data, detailed discussion of output, detailing the R code used.
Part 2: Geometric intuition: the methodology used for graphical display, hierarchical clustering, and putting it all together.
Part 3: Carrying out geometric data analysis, including clustering, using R. Including publication/presentation outputs, storing data for later work, and maintaining the R scripts that are used.
Part 4: Further case studies of questionnaire analysis.
Part 5: Questionnaire analysis, using conjoint, or integrally related, analysis of closed questions, and open or free text questions.
Part 6: Coverage of social media data sources, will be especially centered on Twitter. All sessions will be associated with practical exercises, using case studies.
Final Part: Concluding short debate and discussion on potential and scope for analytics, and statistical treatment of data, and text mining.

3. Target Audience
Practitioners and researchers related to any domains that are encompassed in the case studies, and practical exercises. Students who are undertaking, or who are planning to undertake, any and all such work. Domains of general relevance include:

  • Health and medical surveys,
  • Marketing,
  • Security and forensics,
  • Information and data sourcing through web-based questionnaires,
  • Lifestyle and wellbeing analytics,
  • Legal studies,
  • Political studies,
  • Language and literature,
  • Digital humanities.

The presentation language of the short course is English. Case studies will also be in English as well, however issues related other languages such as Arabic may be addressed.

4. Facilities Required

  • Classrooms equipped with a computer (with the complete software environment) connected to an overhead projector and screen, plus a writing board.
  • Computers for participants. Course participants’ own laptops are also feasible (with the complete software environment).
  • Software: R, open source and openly available with pertinent toolboxes as required., for all computer platforms.
  • Course Material: All course materials, including the data and examples of software use for the case studies, will be made available for course participants, on a password protected web site.
Use Amazon Elastic MapReduce to Process Big Data

Jiming Wu
Associate Professor of California State University, East Bay

Abstract:
This tutorial is to teach audience how to use Amazon Elastic MapReduce (Amazon EMR) to
analyze large amount of data. Amazon EMR is a web service that provides a managed Hadoop framework to simplify big data processing. Topics will include 1) create an Amazon Web Service Account, 2) employ Amazon cloud storage service, 3) run an EMR cluster, 4) set up an EMR job, and 5) examine EMR job output. Intended Audience: graduate students with a concentration on business analytics, data analytics, or data science. Learning outcome: audience will be able to use Amazon EMR to process Big Data.

Description: This is an introduction to Amazon Elastic MapReduce system. Topics include
MapReduce features, Hadoop distributed filesystem, input/output, Amazon storage system, and EMR cluster. Students will have opportunity to use Amazon MapReduce system to process Big Data. The objective of this tutorial is to impart working knowledge and skills associated with Big Data technologies and to let students better understand how companies leverage these technologies to analyze Big Data.

Outline of the content: 1) learn how to create an Amazon Web Service Account, 2) discuss how to employ Amazon cloud storage service, 3) explain how to create and run an EMR cluster, 4) describe how to set up an EMR job, and 5) show how to access and interpret EMR job output.

An example of finding the maximum temperature:
1. Set up a Hadoop cluster on Amazon Elastic MapReduce (EMR)
2. Submit max-temperature.jar to EMR. Please refer to the following website about how to submit a customer Jar: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-launch-customjarcli.html
3. Set up the input file folder on Amazon storage service – S3
4. In Amazon console, set up Jar (max-temperature.jar) and then set up arguments:
MaxTemperature
s3://chapter2/input
s3://chapter2/output
5. Run the Jar on Amazon Elastic MapReduce
6. In Amazon SSH command interface: Amazon local drive folder is /home/hadoop;  Amazon HDFS folder is /user/hadoop
7. Copy file from S3 to Local Disk: aws s3 cp s3://chapter2/MaxTemperatureMapper.java .
8. Copy file from Local Disk to HDFS: hdfs dfs -copyFromLocal file.gz .
9. Compile Java files: javac -cp src/:hadoop-common-2.6.1.jar:hadoop-mapreduce-client-core-2.6.1.jar:commonscli-2.0.jar -d . MaxTemperature.java MaxTemperatureReducer.java
MaxTemperatureMapper.java
10. Create a Jar file: jar -cvf max-temperature.jar MaxTemperature*.class
11. Run a Jar file: hadoop jar max-temperature.jar MaxTemperature /user/hadoop/input0/sample.txt/user/hadoop/output01
12. Display the output on screen: hadoop fs -cat /user/hadoop/output01/part-r-00001

Statement: this tutorial has never been given before.
Materials: PowerPoint slides and Word documents will be provided to attendees.
Bio-sketch: Jiming Wu is an Associate Professor in the Department of Management at California State University, East Bay. He received his B.S. from Shanghai Jiao Tong University, M.S. from Texas Tech University, and Ph.D. from the University of Kentucky. His research interests include knowledge management, IT adoption and acceptance, and computer and network security. His work has appeared in MIS Quarterly, Journal of the Association for Information Systems, European Journal of Information Systems, Information & Management, Decision Support Systems, and elsewhere.