Leniel Maccaferri's blog: StackOverflow

Showing posts with label StackOverflow. Show all posts

Processing Stack Overflow data dump with Apache Spark

Posted by Leniel Maccaferri on 7/04/2017 01:55:00 PM

This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

- Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentação\presentation).pdf (in Portuguese)

- Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation

Objective
Problem
Technologies
Strategy used to acquire the data
Development
Conclusion
Links

Objective

Put into practice the concepts presented during the classes.
Have a closer contact with modern technologies used to process Big Data.
Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

Problem

Analyse StackOverflow data dump available on the internet on a monthly basis.
The data dump is composed of a set of XML files compacted with the .7z extension.
Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
"At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

Technologies

Apache Spark 2.0.1 +
Spark XML 0.4.1 +
Spark SQL 2.0.2
Ubuntu 16.04 LTS (Xenial Xerus)
Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
Scala 2.11.8
XML (Extensible Markup Language)
XSL (Extensible Stylesheet Language)
Play Framework 2.5 (front end)
Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

Strategy used to acquire the data

Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

Development

To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).

hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml

The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.
The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
The Spark SQL library is used subsequently to search the data. Some sample queries were created.
Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.
The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:

spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar

The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.

Conclusion

With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.

Links

Source code @ GitHub - https://github.com/leniel/SparkEclipse
Apache Spark - https://spark.apache.org/
Spark XML - https://github.com/databricks/spark-xml
Spark SQL - http://spark.apache.org/sql/
Scala - https://www.scala-lang.org/
Ubuntu - https://www.ubuntu.com/
Eclipse - https://eclipse.org/

Crowdsourcing & Stack Overflow: the power of the Crowd

Posted by Leniel Maccaferri on 2/01/2017 11:22:00 PM

It's been 2 years since I last posted. Wow! Time flies and life changes...

In 2016 I started a Master's degree course at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

Some background history: I tried to join this same master's program back in 2010. I posted about it here: Masters degree application essay UFRJ 2011.1. At that time I was not accepted in the program because the teacher responsible for the Software Engineering branch and that interviewed me, told me that I didn't have a research profile. I was more like an industry guy. I was upset of course when my application got rejected but I tried again in 2015 and guess what: I applied for the Software Engineering branch again and for the Databases branch too. I got approved in the tests for the two branches. The same teacher interviewed me for the Software Engineering after 5 years. She remembered me and said the same thing. She denied my application again. Thanks God the Database branch teachers were more receptive and accepted my enrollment. I love both areas and Databases involve everything related to data which attract my attention as well.

Lesson: don't give up on your dreams... if you want it with all your heart, go for it! It can take some time but it'll happen.

Back to the post title... as part of the grade, most of the 8 necessary disciplines have a final paper to be developed. This post is about the paper I did for CSCW - Computer Supported Cooperative Work or even Computer Aided Collaborative Work.

I hope it can shed some light regarding the cooperative work put together by the Stack Overflow developer community with the help of our mainstream computers and mobile devices.

Abstract—Crowdsourcing gained attention in the past few years
as a means to disseminate work to a crowd of people. People that
can be scattered all over the world. The Internet removed all the
barriers. Advances in Information Technologies and hardware
allowed the development of tools that aid in the division of the
work and tasks that need to be carried out. We are now in a stage
where what once was supposed to be inviable is proved to be viable
with remote work being done by disparate teams located in
different continents. Computer Supported Cooperative Work is
now a reality and this paper analyzes some of the concepts and use
of modern information systems technology to fill the gap between
a crowd of software developers and their day to day job questions
when applied to a questions and answers site namely Stack
Overflow. Each and every day millions of software developers
outsource their questions to knowledgeable peer developers. Stack
Overflow is the means that allows the knowledge transfer to
happen and it came in a moment when a huge amount of data was
being generated but it wasn’t well structured and organized for
future reuse. Stack Overflow was born and is now a great option
that offers a well-structured and exciting environment to ask and
get answers online.

The full paper in PDF format is available at: https://drive.google.com/file/d/0B4SVxswDPXtwVzNFTWtSMkJoM3M/view?usp=sharing

Showcase StackOverflow flair/badge in LinkedIn profile

Posted by Leniel Maccaferri on 1/20/2011 01:27:00 PM

Today I just thought about showcasing/displaying my StackOverflow flair/badge in my LinkedIn profile. LinkedIn is a good place to show it because it’s a perfect match between your reputation as a software guy and possible good opportunities.

Since LinkedIn doesn’t allow one to put html code snippets in profile fields this showed to be a difficult task at first. Nothing that a little bit more thinking couldn’t solve: Google presentation application to the rescue.

Well, basically what you have to do is create a simple online Google Docs presentation. You could also create an Office PowerPoint presentation and upload it. For sure it’ll give you a lot more customization options. I’ve chosen Google presentation because it suffices my needs for the moment.

Inside that presentation you’ll put an image object that points to your StackOverflow flair image URL. Mine is this one:

http://stackoverflow.com/users/flair/114029.png?theme=dark

Just substitute my StackOverflow user id highlighted above by yours to get your flair.

To change the theme, add ?theme=clean or ?theme=dark or ?theme=hotdog to the end of the image URL.

This URL will render a picture that represents your flair this way:

The bad point in this approach is that as your reputation grows your presentation won’t show your updated flair. You’ll have to open the presentation and re-add the flair image to reflect your current SO reputation.

Google should give us a way to have dynamic content inside a presentation (its spreadsheet service allows this kind of content). This is something that is really missing in presentations. Hey Google, can you hear me?

To give a more personal touch to your presentation you can also put a background image in your slide. I selected a Lamborghini Gallardo because it’s my main car in GT 5 today. So it represents my mood right now. One day I’ll have this car. Just kidding. :)

This is the final result:

This post provides an answer to this Meta StackOverlfow question:

IS it possible to show SO flair on LinkedIn?

Using jQuery to disable/enable and check/uncheck Radio Buttons on Date selected

Posted by Leniel Maccaferri on 12/21/2010 12:22:00 PM

Motivated by this question on StackOverflow - Disable radio button depending on date, I decided to help and here I’m with another code snippet post.

This time I show you how to use jQuery UI and its Datepicker control to control a set of radio buttons (only 2 in this post to make things easier) that have their state (enabled/disabled) changed depending on the date selected by the user.

Here’s the code:

     
<!DOCTYPE html>        
<html>        
<head>        
        
    <meta charset="UTF-8" />        
     
    <title>jQuery UI - Datepicker & Radio Buttons</title>        
        
    <!-- Linking to jQuery stylesheets and libraries -->        
    <link rel="stylesheet" href="http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.7/themes/base/jquery-ui.css" type="text/css" media="all" />        
        
    <link rel="stylesheet" href="http://static.jquery.com/ui/css/demo-docs-theme/ui.theme.css" type="text/css" media="all" />        
    
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js" type="text/javascript"></script>        
        
    <script src="http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.7/jquery-ui.min.js" type="text/javascript"></script>        
    
    <script src="http://jquery-ui.googlecode.com/svn/tags/latest/external/jquery.bgiframe-2.1.2.js" type="text/javascript"></script>        
        
    <script src="http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.7/i18n/jquery-ui-i18n.min.js" type="text/javascript"></script>        
        
</head>        
        
<body>        
        
    <script>        
        
    $(function()        
    {         
        $("#datepicker").datepicker        
        ({           
            // Event raised everytime a date is selected in the datepicker        
            onSelect: function(date)        
            {         
                // Self explanatory :) - used to get today's date        
                var today = new Date();        
        
                // Business logic to change radio buttons' state                 if($("#datepicker").datepicker("getDate") > today)        
                {         
                    $("#radioButton1").attr('disabled', true);        
                    $("#radioButton2").attr('disabled', false);  
                }         
                else        
                {         
                    $("#radioButton1").attr('disabled', false);        
                    $("#radioButton2").attr('disabled', true);  
                }         
            }         
        });         
        
        // Just setting the default localization for the datepicker        
        $.datepicker.setDefaults($.datepicker.regional['']);  
    });         
     
    </script>        
        
    <p>Date: <input id="datepicker" type="text"></p>        
        
    <input id="radioButton1" type="radio" value="myValue1" name="radioButton1"/>Radio button 1<br/>       
    <input id="radioButton2" type="radio" value="myValue2" name="radioButton2"/>Radio button 2       
        
</body>        
        
</html>

If you wanted to control the state (checked/unchecked) you’d have to make a small change in the code as follows:

     
// Business logic to change radio buttons' state                 if($("#datepicker").datepicker("getDate") > today)        
{  
    $("#radioButton1").attr('checked', true);  
    $("#radioButton2").attr('checked', false);   
}  
else  
{  
    $("#radioButton1").attr('checked', false);  
    $("#radioButton2").attr('checked', true);   
}

When you open the page for the first time you get this screen:

Figure 1 - Page when viewed for the first time (both radio buttons are enabled)

If you pick a date that is greater than today’s date, Radio button 1 is disabled (turns to gray) and Radio button 2 is enabled according to the logic implemented.

Figure 2 - Radio button 1 is disabled (turns to gray) and Radio button 2 is enabled

Otherwise, Radio button 2 is disabled and Radio button 1 is enabled:

Figure 3 - Radio button 2 is disabled and Radio button 1 is enabled

Hope you make good use of it.

StackOverflow: best place to share/learn programming

Posted by Leniel Maccaferri on 7/26/2010 08:26:00 PM

I’ve been spending some time of my days helping others at StackOverflow. StackOverflow (SO) is the best place to ask questions related to software programming. There you can be sure that someone somewhere will help you to find an answer to your question.

Note: I just got out of my job at Chemtech. Now I have more time to focus in other things and to think a little bit more about what I want to do next.

As a consequence I decided that I’ll try to give back and share a portion of the knowledge I acquired in these 7 years of programming experience. By means of helping others at StackOverflow I just improve what I think I already know. This is a bit controversial you may say, but I don’t think so. I’m constantly learning/unlearning and discovering new things at SO. I keep trying to sharpen my programming skills. This just happens somewhat in a recursive way. One finding leads to other that then brings you back to the main topic that then expands and so forth.

Give, and it will be given to you. Good measure, pressed down, shaken together, running over, will be put into your lap. For with the measure you use it will be measured back to you.
Luke 6:38

I’m kind of a generalist (I’m after the generalist badge), that is, I don’t tie myself to any technology. I’ll try to use the one that fits better in a given task/job be it close-source or open-source. If a technology allows me to get the thing done that’s the one I’ll choose. This is reflected in the variety of tags at my StackOverflow profile:

Of course there are well established technologies that are easier to work with as is the case of C# programming language, ASP.NET, Java, etc. Again this can be seen both on my tags and in the quantity of questions tagged with such technologies. Being easier to work with means having a greater user base throughout the world and this is reflected at StackOverflow tags as well.

The title of this post is what motivated me to write and I hope will motivate others too so that they give back a portion of what they know.

The point is this: whoever sows sparingly will also reap sparingly, and whoever sows bountifully will also reap bountifully. Each one must give as he has decided in his heart, not reluctantly or under compulsion, for God loves a cheerful giver. And God is able to make all grace abound to you, so that having all sufficiency in all things at all times, you may abound in every good work.
2 Corinthians 9:6-8

I then invite you: if you’d like to get free answers to your coding questions or if you want to write better code or if you want to start at SO and be happy or whatever related to coding problems, I suggest you to read the FAQ and create an account at SO.

You see, we should help each other. There are times that we get stuck at some coding problems for hours and we don’t know what to do to get over them. With millions of programming peers having the most varied backgrounds willing to help and sharing what they know is what makes a developer’s life valuable and exciting. At StackOverflow you find what summarizes this last phrase. In just a minute you can get an answer that’ll allow you to continue your work. If you compare minutes to hours it’s clear why StackOverflow is so fantastic.

The sum of our good programming efforts builds a better software product. LOL <(^^,)>
By Leniel Macaferi

As of the time of this writing, for stats purposes, I have 4257 reputation,18 badges, 2 questions asked and 288 questions answered at SO. A pretty good amount of answers as a start for someone who’s also after the fanatic badge. I can assure you I’m more than half way there already. Badges are not the point of this post but they make the whole thing a little bit more motivating. Badges at SO are recognitions that come with time just as in real life. The more you share the more recognition you earn.

I take my hat off to the people that started such great site. I hope that it grows even more in the years to come with more and more users trying to learn this admirable profession of software developer that in my humble opinion is the best one. I’m biased towards it after all.

At SO we just happen to have pretty good discussions about anything related to programming including What’s your favorite “programmer” cartoon? and Is a master’s degree overkill? ranging to the more diverse topics. In one of those really interesting discussions I found this brilliant saying that I think holds true in this world:

When you finish your bachelor's, you think you know everything; When you finish your master's, you realize you know nothing; When you finish your doctorate, you realize nobody knows anything, including whether or not you needed to finish a doctorate to realize that.
By Unknown

Although I haven’t taken a master’s or doctorate course yet I can write about the bachelor’s part. When I finished the bachelor’s I already knew that I didn’t know nothing. I think that when I finish the master’s I’ll realize nobody knows anything. One less step to realize that. What will happen then if I take a doctorate course? Simply put: (no one knows nothing ) ^ Googol I think. :)

Final Note: I also put in practice this same share and learn principle at ProZ.com.

Further references
I suggest you check these two posts by Jon Skeet:

Writing the perfect question
Answering technical questions helpfully