Leniel Maccaferri's blog: masters degree

Showing posts with label masters degree. Show all posts

Processing Stack Overflow data dump with Apache Spark

Posted by Leniel Maccaferri on 7/04/2017 01:55:00 PM

This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

- Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentação\presentation).pdf (in Portuguese)

- Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation

Objective
Problem
Technologies
Strategy used to acquire the data
Development
Conclusion
Links

Objective

Put into practice the concepts presented during the classes.
Have a closer contact with modern technologies used to process Big Data.
Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

Problem

Analyse StackOverflow data dump available on the internet on a monthly basis.
The data dump is composed of a set of XML files compacted with the .7z extension.
Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
"At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

Technologies

Apache Spark 2.0.1 +
Spark XML 0.4.1 +
Spark SQL 2.0.2
Ubuntu 16.04 LTS (Xenial Xerus)
Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
Scala 2.11.8
XML (Extensible Markup Language)
XSL (Extensible Stylesheet Language)
Play Framework 2.5 (front end)
Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

Strategy used to acquire the data

Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

Development

To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).

hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml

The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.
The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
The Spark SQL library is used subsequently to search the data. Some sample queries were created.
Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.
The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:

spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar

The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.

Conclusion

With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.

Links

Source code @ GitHub - https://github.com/leniel/SparkEclipse
Apache Spark - https://spark.apache.org/
Spark XML - https://github.com/databricks/spark-xml
Spark SQL - http://spark.apache.org/sql/
Scala - https://www.scala-lang.org/
Ubuntu - https://www.ubuntu.com/
Eclipse - https://eclipse.org/

Using Zotero to convert Springer Link CSV search result to BibTex format

Posted by Leniel Maccaferri on 6/10/2017 10:37:00 AM

Today I needed to generate a BibTex file to serve as input to Parsif.al.

Parsif.al is an online tool designed to support researchers to perform systematic literature reviews within the context of Software Engineering.

I hit a brickwall while doing a search in Springer Link because it only gives us a CSV file with the entire search result. It caps the result to the first 1000 registries. It'd be a pain to click and open each and every search result to be able to export the corresponding BibTex.

Using Zotero it's easy to get a BibTex out of the CSV file generated by Springer Link.

Figure 1 - Zotero's Add item(s) by Identifier dialog

Follow these simple steps:

1 - Open the CSV file in Excel for example and copy the column that contains the item DOI [ Digital Object Identifier ];

2 - Paste the DOI(s) into Zotero's Add item(s) by identifier (see Figure 1 above). Wait while it imports...

3 - Select the folder where you imported the DOI(s); (Player Modeling in Figure 1)

4 - Right click the folder and select Export collection... pick BibTex.

You're done.

Hope it helps.

References:

Adding Items to your Zotero Library

Crowdsourcing & Stack Overflow: the power of the Crowd

Posted by Leniel Maccaferri on 2/01/2017 11:22:00 PM

It's been 2 years since I last posted. Wow! Time flies and life changes...

In 2016 I started a Master's degree course at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

Some background history: I tried to join this same master's program back in 2010. I posted about it here: Masters degree application essay UFRJ 2011.1. At that time I was not accepted in the program because the teacher responsible for the Software Engineering branch and that interviewed me, told me that I didn't have a research profile. I was more like an industry guy. I was upset of course when my application got rejected but I tried again in 2015 and guess what: I applied for the Software Engineering branch again and for the Databases branch too. I got approved in the tests for the two branches. The same teacher interviewed me for the Software Engineering after 5 years. She remembered me and said the same thing. She denied my application again. Thanks God the Database branch teachers were more receptive and accepted my enrollment. I love both areas and Databases involve everything related to data which attract my attention as well.

Lesson: don't give up on your dreams... if you want it with all your heart, go for it! It can take some time but it'll happen.

Back to the post title... as part of the grade, most of the 8 necessary disciplines have a final paper to be developed. This post is about the paper I did for CSCW - Computer Supported Cooperative Work or even Computer Aided Collaborative Work.

I hope it can shed some light regarding the cooperative work put together by the Stack Overflow developer community with the help of our mainstream computers and mobile devices.

Abstract—Crowdsourcing gained attention in the past few years
as a means to disseminate work to a crowd of people. People that
can be scattered all over the world. The Internet removed all the
barriers. Advances in Information Technologies and hardware
allowed the development of tools that aid in the division of the
work and tasks that need to be carried out. We are now in a stage
where what once was supposed to be inviable is proved to be viable
with remote work being done by disparate teams located in
different continents. Computer Supported Cooperative Work is
now a reality and this paper analyzes some of the concepts and use
of modern information systems technology to fill the gap between
a crowd of software developers and their day to day job questions
when applied to a questions and answers site namely Stack
Overflow. Each and every day millions of software developers
outsource their questions to knowledgeable peer developers. Stack
Overflow is the means that allows the knowledge transfer to
happen and it came in a moment when a huge amount of data was
being generated but it wasn’t well structured and organized for
future reuse. Stack Overflow was born and is now a great option
that offers a well-structured and exciting environment to ask and
get answers online.

The full paper in PDF format is available at: https://drive.google.com/file/d/0B4SVxswDPXtwVzNFTWtSMkJoM3M/view?usp=sharing

Masters degree application essay UFRJ 2011.1

Posted by Leniel Maccaferri on 11/11/2010 06:23:00 PM

While applying to a masters degree course I had to write an essay of no more than 500 words containing:

1. Personal appreciation about the evolution of my academic and professional activities up to now, avoiding the mere repetition of information already contained in my resume/CV.
2. Succinct description about the reasons I have to take a post-graduation course and what I expect from the course.
3. My expectations regarding the post-graduate course's influences in my future professional activities.
4. Specification of topics of interest, trying to correlate them with the research area of the program I'm applying to.

So far so good.

I have sent my enrollment docs to two of the best universities in the Rio de Janeiro state in Brazil, namely: UFRJ and PUC-Rio. Both of them have the highest assessment grades ( 7 ) conferred by CAPES. CAPES is the Brazilian government agency responsible for the evaluation of post-graduate courses.

I applied for two research areas: Software Engineering and Artificial Intelligence.

The essay I'm posting here (English and Portuguese versions) is from the UFRJ enrollment process (Software Engineering one) in which I applied to the Systems and Computer Engineering Program (site in Portuguese).

Now I hope to be called for at least one of these institutions. UFRJ process requires the applicant to take some tests: language test (English) and specifics test (the research area you're applying to) and both counts towards the acceptance.

Essay
I have been trying for almost three years now (since the graduation in Computer Engineering in Dec/2007) to apply at work and in my life the content I learned.

I succeeded in my last job at Chemtech because I had a good background provided by the Computer Engineering course. During the course, the disciplines that really caught my attention were those related to software. In this last job experience I could put into practice the theory seen at the university in my first level degree. I participated in several interesting software projects. I found this way the importance of theory in the activities of a computer professional.

My motivation to continue the studies through a masters course exists since I finished the first level degree. Meanwhile, between the graduation and the master's degree, I got a job. One of the reasons that made me quit that job was simply because I wanted to continue studying. I attended great part of the university only studying, which allowed me to have a good performance. Likewise, I wish to attend the masters in full time.

I want to take the post-graduate course to:

1 - Learn and get a deep understanding of the software area;
2 - Improve what I know;
3 - Grow as a computer professional;
4 - Get better opportunities in the job market.

My expectations for the course are the best. I know the the course and institution reputation and I know I can learn a lot. This is my main goal: to learn more and with quality. I am hungry for knowledge!

After the masters and possessing more advanced knowledge, I intend to pursue a career in software. The software engineering market is "young" compared to others and has shown steady growth in recent years. I believe that this market will provide excellent opportunities for the professional who has a post-graduate degree in the area. Money Magazine and Salary.com site rated the area of Software Engineering as the best area to work in 2006. This demonstrates and attests this area's power and influence in the global market. When I mention a career, I think of software companies or even in educational institutions as a teacher/university researcher. Any of these options I choose will satisfy me as a person and as professional. If I decide to go with a software company, I hope to contribute with a deeper view on the aspects involving software projects. Otherwise, I hope to be able to disseminate advanced knowledge in the area, teaching people. Brazil in particular has a deficit of engineers and particularly in the area of software engineering, which is a technology area that adds greater value and therefore contributes more effectively to the growth of our country.

My interest in the area of software engineering is focused on the development of software products. I love writing code, and consequently the programming languages, solve programming problems, study the software tools used in development and all the metrics involved and any possible subject that is related to software engineering.

See:

http://www.leniel.net
http://stackoverflow.com/users/114029/leniel-macaferi

Redação
Tenho buscado nestes quase 3 anos de formado em Engenharia de Computação aplicar no trabalho e nada vida o conteúdo que aprendi.

Tive êxito em minha última experiência profissional devido à base proporcionada pelo curso de Engenharia de Computação. Durante o curso, as matérias que mais chamaram minha atenção foram aquelas ligadas a software. Nesta última experiência profissional pude colocar em prática a teoria vista na faculdade. Participei em vários projetos de software interessantes. Constatei dessa maneira a importância da teoria nas atividades do profissional de computação.

Minha motivação para dar continuidade nos estudos através de um curso de mestrado existe desde que terminei o curso de graduação. Neste meio tempo, entre a graduação e o mestrado, consegui um emprego. Um dos motivos que me fez sair desse emprego foi justamente o de querer continuar os estudos. Cursei a maior parte da graduação somente estudando, o que me permitiu ter um bom aproveitamento. Da mesma forma, desejo cursar o mestrado com dedicação exclusiva.

Quero fazer o curso de pós-graduação para:

1 - Aprender e me aprofundar mais na área de software;
2 - Aprimorar o que sei;
3 - Crescer como profissional da área;
4 - Conseguir melhores oportunidades no mercado de trabalho.

Minhas expectativas quanto ao curso são as melhores. Conheço a reputação do curso e da instituição e sei que poderei aprender bastante. Este é o meu principal objetivo: aprender mais e com qualidade. Sou faminto por conhecimento!

Após o mestrado e de posse do conhecimento mais avançado, pretendo seguir carreira na área de software. O mercado de engenharia de software é "jovem" se comparado a outros e tem apresentado constante expansão nos últimos anos. Creio que este mercado proporcionará excelentes oportunidades para o profissional que possui uma pós-graduação na área. A revista Money Magazine e o site Salary.com, classificaram a área de Engenharia de Software como a melhor área para se trabalhar no ano de 2006. Isso mostra e atesta o poder e influência da área no mercado global.Quando menciono seguir carreira, penso em empresas de software ou até mesmo em instituições de ensino como professor/pesquisador universitário. Qualquer uma dessas opções que eu escolher me satisfará como pessoa e profissional. Caso eu siga em uma empresa de software, espero contribuir com uma visão mais ampla sobre os aspectos que envolvem os projetos de software. Caso contrário, espero ter a possibilidade de disseminar o conhecimento avançado na área, ajudando a formar pessoal capacitado. O Brasil em especial tem um déficit na área de engenharia e particularmente na engenharia de software, que é uma área tecnológica que agrega maior valor e portanto contribui de maneira mais eficaz para o crescimento do nosso país.

Meu interesse pela área de engenharia de software é centrado no desenvolvimento do produto de software. Amo escrever código e consequentemente as linguagens de programação, resolver problemas de programação, estudar ferramentas de software usadas no desenvolvimento e todas as métricas envolvidas e qualquer outro assunto possível que seja relativo à engenharia de software.

Veja:

http://www.leniel.net
http://stackoverflow.com/users/114029/leniel-macaferi