Showing posts with label Apache. Show all posts
Showing posts with label Apache. Show all posts

Processing Stack Overflow data dump with Apache Spark

This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentaĆ§Ć£o\presentation).pdf (in Portuguese)

Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation
  1. Objective
  2. Problem
  3. Technologies
  4. Strategy used to acquire the data
  5. Development
  6. Conclusion
  7. Links
  1. Objective
    • Put into practice the concepts presented during the classes.
    • Have a closer contact with modern technologies used to process Big Data.
    • Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

  2. Problem
    • Analyse StackOverflow data dump available on the internet on a monthly basis.
    • The data dump is composed of a set of XML files compacted with the .7z extension.
    • Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
    • Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
    • "At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

  3. Technologies
    • Apache Spark 2.0.1 +
    • Spark XML 0.4.1 +
    • Spark SQL 2.0.2
    • Ubuntu 16.04 LTS (Xenial Xerus)
    • Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
    • Scala 2.11.8
    • XML (Extensible Markup Language)
    • XSL (Extensible Stylesheet Language)
    • Play Framework 2.5 (front end)
    • Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

  4. Strategy used to acquire the data
    • Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
    • Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

  5. Development
    • To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).
    • hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml
    • The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.


    • The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
    • The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
    • The Spark SQL library is used subsequently to search the data. Some sample queries were created.
    • Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
    • The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
    • For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.



    • The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
    • In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:
    • spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar
    • The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
    • The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
    • When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.


  6. Conclusion
    • With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
    • Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
    • The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.

  7. Links

Installing PHP on Mac OS X Snow Leopard 10.6.5

Motivated by this question at StackOverflow: RegExp PHP get text between multiple span tags, I decided to help.

Recently I got a Mac mini. I still hadn’t played with PHP on Mac and to debug my answer to that question I needed a way to test the code. So I thought: why not also give PHP a try on Mac OS since its my main OS today? Oh, good idea, go learn something new… :D

The first thing I did obviously was recurring to Google and searching for something that could help me get there.

I hit a pretty good tutorial to enable PHP on Mac at About.com written by Angela Bradley that gets to the point: How to Install PHP on a Mac. Along the way I had to solve only one minor thing described in the caveat section at the end of this post.

You see that the title of this post has the word installing (well I thought I had to install it – that was my first reaction), but in fact it could be the word enabling because PHP is an integral part of Mac OS X Snow Leopard and we just need to enable it as you’ll see soon.

Here we go. Follow these steps:

1 - Enabling the Web Server
PHP works hand in hand with a webserver. Mac OS already comes with Apache web server and so we just need to enable it. To do so, open System Preferences in the Dock. Then click the 'Sharing' icon in the Internet & Network section. Check the ‘Web Sharing’ box.

Web Sharing option under the Sharing configuration in System Preferences
Figure 1 - Web Sharing option under the Sharing configuration in System Preferences

Now type this address in your browser: http://localhost/. You should get a message that reads: It works!

2 - Enabling PHP
PHP also comes bundled in Mac OS, but it’s disabled by default. To enable it we need to edit a hidden system file located in this path: /private/etc/apache2/httpd.conf.

I used BBEdit text editor to edit such a file (note that I marked the box Show hidden items in the screenshot below):

Editing a hidden file with BBEdit
Figure 2 - Editing the hidden system file httpd.conf with BBEdit

Now within that file search for:

libexec/apache2/libphp5.so

Delete the character # in the start of the line so that that entire line should now read:

LoadModule php5_module          libexec/apache2/libphp5.so

Save the file.

3 - Testing the installation
There’s nothing better to test the PHP installation than using its own information. To accomplish this, write a one liner simple .php file named test.php with this content:

<?php phpinfo() ?>

Place this file inside your personal website folder. Mine is located in this path:

/Users/leniel/Sites/test.php

Now let’s test this page by typing its address in the browser:

http://192.168.1.103/~leniel/test.php

As you see, the address points to my personal website as seen in Figure 1.

When you run this simple .php page you should get something like this:

Testing PHP installation with its own configuration’s information
Figure 3 - Testing PHP installation with its own configuration’s information

If your PHP installation is OK, the test page will display PHP's information. If it displays the page code, you need to restart the Apache server.

You can restart Apache by entering the following in Terminal:

sudo apachectl restart

Try to reload the page again. Everything should work.

Well done!. Now you can play with PHP on your Mac computer and write some neat codez. :)

Caveat
When I tried to restart Apache server I got the following error:

ulimit: open files: cannot modify limit: Invalid argument

I resorted to this solution: Mac OS X 10.6.5 broke my apachectl