Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Processing Stack Overflow data dump with Apache Spark

This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentação\presentation).pdf (in Portuguese)

Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation
  1. Objective
  2. Problem
  3. Technologies
  4. Strategy used to acquire the data
  5. Development
  6. Conclusion
  7. Links
  1. Objective
    • Put into practice the concepts presented during the classes.
    • Have a closer contact with modern technologies used to process Big Data.
    • Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

  2. Problem
    • Analyse StackOverflow data dump available on the internet on a monthly basis.
    • The data dump is composed of a set of XML files compacted with the .7z extension.
    • Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
    • Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
    • "At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

  3. Technologies
    • Apache Spark 2.0.1 +
    • Spark XML 0.4.1 +
    • Spark SQL 2.0.2
    • Ubuntu 16.04 LTS (Xenial Xerus)
    • Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
    • Scala 2.11.8
    • XML (Extensible Markup Language)
    • XSL (Extensible Stylesheet Language)
    • Play Framework 2.5 (front end)
    • Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

  4. Strategy used to acquire the data
    • Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
    • Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

  5. Development
    • To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).
    • hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml
    • The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.


    • The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
    • The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
    • The Spark SQL library is used subsequently to search the data. Some sample queries were created.
    • Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
    • The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
    • For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.



    • The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
    • In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:
    • spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar
    • The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
    • The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
    • When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.


  6. Conclusion
    • With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
    • Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
    • The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.

  7. Links

Validating Date with MVC Foolproof Validation on the client side

This is a quick tip for something that I just tried with an ASP.NET MVC app and that worked as expected.

I use MVC Foolproof Validation to complement the standard validators that come with ASP.NET MVC in the System.ComponentModel.DataAnnotations assembly. Foolproof has some awesome validators like:

[Is]
[EqualTo]
[NotEqualTo]
[GreaterThan]
[LessThan]
[GreaterThanOrEqualTo]
[LessThanOrEqualTo]
[RequiredIf] [RequiredIfNot] [RequiredIfTrue] [RequiredIfFalse] [RequiredIfEmpty] [RequiredIfNotEmpty] [RequiredIfRegExMatch] [RequiredIfNotRegExMatch]

You can get it through a NuGet package here.

Let’s say you have a form with a DateTime field called “Show Until” that is part of a Notice object. This property stores a date value that should be in the future obviously. It makes no sense to set a date in the past for this field. So the question is: how can you get client side validation for this field without the need of going to the server to compare the date the user entered with today’s date?

This is what this post will show you…

This is the model property decorated with the respective data annotation GreaterThanOrEqualTo:

[Required(ErrorMessageResourceName = "Required", ErrorMessageResourceType = typeof(Localization))]
[Display(Name = "ShowUntil", ResourceType = typeof(Localization))]
[GreaterThanOrEqualTo("Today", ErrorMessageResourceName = "GreaterThanOrEqualTo", ErrorMessageResourceType = typeof(Localization))]
public DateTime ShowUntil { get; set; }

Pay attention to that "Today", string value. It tells Foolproof’s validator to compare the ShowUntil property with a property called Today (the dependent property). Wait a moment: the Notice object doesn’t have a property called Today since it’s not necessary. What’s the point? Confused smile

Now let’s add a partial class to extend the Notice class with a property called Today. This property returns Today’s date of course.

public partial class Notice
{
    public DateTime Today
    {
        get { return DateTime.Today; }
    }
}

Run the app now, enter a date in the past for the ShowUntil field and submit the form. You’ll see that a postback is made to the server and only after that the validation error message is shown to the user. Foolproof’s validator doesn’t know what’s the value of Today yet since it’s not part of the form data being edited. Remember: this kind of server side validation is not what we want. We want client side validation. We don’t want this data round-trip to the server just to validate one field in the form. So, how to solve this? It’s pretty simple…

Inside your form view (.cshtml file) since I’m using C# as the programming language and Razor view engine, add this hidden field:

@Html.HiddenFor(model => model.Today)

Now reload the page and try entering a date in the past like yesterday’s date. You’ll see that the user gets the validation error message instantly. Why? Because now Foolproof’s jQuery validator that is added to page has Today’s date value present in the form’s HTML output (that’s the purpose of the hidden field). If you look at the form/page source code in the browser (right click the browser window and select “View Page Source” in Firefox), you’ll se this value:

<input data-val="true" id="Today" name="Today" type="hidden" value="29/06/2012 00:00:00" />

That’s it. No date in the past is allowed when adding a new Notice to the system.

This use case can be considered a good use for Hidden fields. They serve a multitude of purposes…

Hope it helps.

Updated at 12:45 PM

You see how things are... after posting this I just discovered a really straightforward way of getting what I wanted (by chance) I can tell you. I wasn’t looking for this anymore when I hit this StackOverflow question. Looks like God took me there… Open-mouthed smile I happen to be using jQuery UI Datepicker control too. It has between its vast array of options a minDate property. So if you do this:

<script type="text/javascript">

    $(document).ready(function ()
    {
        $("#ShowUntil").datepicker('option', 'minDate', new Date());
    });

</script>

new Date() is Today in Javascript and so the datepicker won’t allow the selection of dates in the past. One word: AWESOME. One liner with jQuery help!

Apple’s magic mouse & wireless keyboard battery life

I’ve been collecting data regarding the battery life cycle of my Apple’s magic mouse and wireless keyboard. I thought it’d be good to post this data here so that you dear reader can draw your own conclusions about how long should the battery life last.

Apple Magic Mouse ($ 69 or R$ 229 today)Apple Wireless Keyboard ($ 69 or R$ 229 today)

This is interesting data to look at because I like statistics and it’s good to analyze data. It helps you predict what to expect… Obviously the batteries’ lifespan changes according to how much time you spend using these devices. In my case I tend to use the computer 8 hours/day on average.

Just now I decided to search for a neat app ( Finch for Mac ) so that I can track the time I pass working on the computer. This will for sure give me more accurate data in the near future. For now I’ll have to take this data for granted…

Take a look at the following Google spreadsheet:

As you see the keyboard has a great autonomy and there were periods in which I didn’t work as much as I’d like on the computer because the mouse charges used to last 3 weeks. This denotes periods of inactivity. Now that I’m doing more intensive work the mouse battery lasts only 1.5 weeks. This is half the time previously taken to drain the battery.

For the record: I also own an Apple’s battery charger. Recharging the batteries is an easy task. Once more I’m contributing to the environment.

Apple Battery Charger ($ 29 today)

All these Apple devices are great IMHO as it’s the case with everything made by Apple!

The only “complaint” is that I think the mouse should have a greater autonomy. Maybe this will be the case with more recent technology iterations.

Tracking Last.fm profile visits with Google Analytics

I'm a big fan of analytics data; therefore I try to collect stats about anything – like visitors that check my web pages/profiles. It’s no different with my Last.fm profile.

Last.fm profile page uses BBCode markup that is a lightweight markup language used to format posts in many message boards and unfortunately it has no JavaScript support.
Google Analytics (GA) on the other hand makes use of JavaScript code when it comes to the tracking code one must include in their website. This causes a frustrating experience to anyone that tries to integrate the GA tracking code in places where JavaScript code isn’t accepted.

While searching for a solution some time ago I didn’t find a way to solve this impedance, but yesterday after putting a little bit more caring on my Google search query, I finally found a way of filling the gap so that both services could talk to each other: NoJSStats, short for No JavaScript Stats is the missing piece. It’s a free web service. Basically NoJSStats uses Google App Engine to track visitors using just an image or anything that allows external resource requests in the website you want to track.

Here’s what you need to do in order to get analytics data popping up at your Google Analytics account in respect to your Last.fm profile:

First, let’s configure GA side…

1 - Go to Google Analytics and click the Gear icon at the top right of any page and create a new Web Property:

 Creating a new Last.fm Google Analytics Web PropertyFigure 1 - Creating a new Last.fm Google Analytics Web Property

2 - Name the Web Property and in the Web Site URL field enter your Last.fm profile URL and click Create property:

Filling Last.fm Web Property with correct dataFigure 2 - Filling Last.fm Web Property with the correct data

3 - Take note of the Web Property ID assigned to your new Web Property:

Checking Web Property ID that’ll be used when configuring Last.fm profileFigure 3 - Checking Web Property ID that’ll be used when configuring Last.fm profile

Now let’s configure Last.fm side:

4 - Edit your Last.fm profile clicking in the Edit link located in the right side of your profile page:

Edit button used to get access to Last.fm profile dataFigure 4 - Edit button used to get access to Last.fm profile data

5 - Head to the “About You” field and add this line at the end:

[img]http://nojsstats.appspot.com/UA-1234567-89/www.last.fm/user/leniel[/img]

Do not forget to click the Save details button.

Two things related to that img URL deserve a note here:

You’re using the Last.fm GA Web Property ID, for example UA-1234567-89 taken from step 3 above.
You’re pointing to your Last.fm profile at the last part of the URL.

DO CHANGE these two parts accordingly.

Interesting enough to try is that if you copy & paste the URL from step 5 in your browser, you get this result:

http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=37172618&utmsr=-&utmsc=-&utmul=-&utmje=0&utmfl=-&utmdt=-&utmhn=www.last.fm/user/leniel&utmr=&utmp=&utmac=UA-1234567-89&utmcc=__utma%3D167996071.415430387.1319322154.1319322154.1319322154.1%
3B%2B__utmb%3D167996071%3B%2B__utmc%3D167996071%3B%2B__utmz%
3D167996071.1319322154.2.2.utmccn%3D%28direct%29%7Cutmcsr%3D%28direct%
29%7Cutmcmd%3D%28none%29%3B%2B__utmv%3D167996071.189.24.47.122%3B

Awesome! NoJSStats does its magic and all GA data -- every single item is assembled and packed into the Request URL's query string (everything after the '?'). For more technical details, see this StackOverflow question: Why does Google Analytics use __utm.gif?

This is everything you need to do. Easy, no?

If you have problems
I have tested this workflow and it’s indeed working as expected. In case you can’t get it working:

If you check the Tracking Code status in GA it says: "Tracking Not Installed".

Last.fm Tracking Code statusFigure 5 - Last.fm Tracking Code status

It doesn't matter... really. That's because we aren't using the standard JavaScript code provided by GA. GA is waiting to detect the presence of that code. We are not using this approach and so it's OK if it reports "Tracking Not Installed".

Analytics data appear within a time frame. It's not real-time at the moment.

If you still do not see any data after a day or two, please leave a comment in this post and I’ll try to help.

Final notes
You should also see data on GA about your own visits to your profile! You can block this from being counted. Check this: How do I exclude my internal traffic from reports?

One of the advantages of using this approach instead of others like FLAG counter that I usually see all over Last.fm profiles is that you keep your data private and get way more stats from your visitors.

Last.fm is missing visitor Analytics features in its current incarnation. I sure hope it implements something interesting in this area.

For now, I hope you get a better insight of how your profile is performing in the internetz.

SQL UPDATE statement with SELECT and SQL Server Image data type column

In a SQL Server database I have a table called users which has a column named signature. This column is of type Image.

My production SQL Server is located in a shared hosting environment.

One problem I’ve been facing lately is that I need permission to execute an UPDATE statement to insert a signature image for a given user. This problem occurs because to insert a signature image for a given user I have to execute a BULK statement like this for example:

--Update existing user
UPDATE users SET [signature] =(SELECT MyImage.* from Openrowset
Bulk 'C:\MyProject\trunk\MyCompany.Product\MyCompany.Product.Web\
Images\Signature.jpg'
, Single_Blob) MyImage) where Id
= '1111aaaa-1111-11aa-a111-111111a1a1a1'

The query above works fine in my local machine but when I tried to execute it on the remote/production server, I got this beautiful message:You do not have permission to use the bulk load statement.

In the shared hosting environment the execution of Bulk command is disabled by default for security reasons. This is annoying but totally understandable!

Using Bulk load - allows the user to populate a database from a file. It’s not available in shared environment because it is necessary to insert client files on SQL server locally (the production server).

So… great! I need a way to bypass this limitation because I won’t spend tubes of money paying a dedicated server… it’d make sense if and only if I needed a dedicated server.

1st try: import a specific user row from my local SQL Server to the remote instance using SQL Server Management Studio Import task. I got an error about constraint key violation because I already had the same row (for that user) on both databases. It’s just a matter of updating the signature column in the production database. This seemed to be a pain path.

2nd try: consider a dedicated server? Smiley pensativo No thanks… hehehe

3rd try and solution: a few days later I found myself thinking about this problem again (this signature column updating thing is a recurring task) and so I decided to find another way and it came to light - link the remote server to my local SQL Server Express instance and write a beautiful SQL query that does the job.

First I stopped in this excellent blog post with a step by step guide written by jen: Create Linked Server SQL Server 2008
This post provided everything I needed to link both SQL Server instances.

To make sure you have linked your server correctly, you can execute this query in your local server:

select server_id, name, product, provider, data_source, [catalog], is_linked
from sys.servers

The above query gave me this result:

Linked servers linked to my local SQL Server Express instance
Figure 1 - Linked servers linked to my local SQL Server Express instance

Then I Googled about Update with Select + SQL Server or something like that and found this StackOverflow question: Updating a table with multiple values from a select statement where the date matches. Lieven’s answer helped. I just had to adapt it to my case. This is the SQL code that does the dirty work:

UPDATE  U
SET     U.[signature] = users.[signature]
FROM    [LOCAWEB].[laudotech].[dbo].[users] U INNER JOIN users
ON users.id = U.id AND U.id = '1234aaaa-5678-90aa-b123-456789a0a1a2'

The above query must be executed within the context of the local SQL Server instance of course. Where the linked server resides.

To give you a view… this is how all this is configured inside SQL Server Management Studio (SSMS):

SSMS Object Explorer and the Linked Server LOCAWEB in my local SQL Server Express instanceFigure 2 - SSMS Object Explorer and the Linked Server LOCAWEB in my local SQL Server Express instance

There’s so many things one can do with SSMS that I feel really happy in learning one more of those things. Last week I blogged about Import/Export SQL Server database data with SSMS. Take a look at it.

Man! Have I said that I Iike working with databases!?

Hope it helps.

Import/Export SQL Server database data with SSMS

This post is a visual guide about one of the most powerful features of Microsoft SQL Server Management Studio (SSMS) 2008: Import/Export database data.

These tasks are useful for example when you need to debug an application with real world/production data to find possible bugs in your code. You can import the production database data to your local development database and find what’s causing problems. These tasks can be used to create a protection layer because you won’t be working “directly” with production data.

As you see, if I right click a database in SSMS I get access to this rich menu full of interesting options that deserve a careful study:

SSMS 2008 Import Export Database Tasks

Looking at the above picture you see in Object Explorer that I’m connected to two SQL database servers. The first one is a remote SQL Server (version 10.0.2775) and the other one is my local SQL Server Express (version 10.50.1617) instance.

With this configuration I have the perfect scenario to visually show you how the Import/Export Data… menu options work.

In this blog post I’ll execute the Export Data task. This allows me to test my web application using my local SQL Server instance with real world/production data.
The Import Data… task does practically the same thing but in reverse, logically. I’ll let the execution of the Import task to you dear reader!

OK… If I proceed and select the Export Data… menu option I’ll be able to bring new data to the database that is located on my machine. Data comes from the remote/production server to my local copy of the same database.

After clicking Export Data…, SQL Server Import and Export Wizard pops up. I then select the Data source/Server name (from which Data source type and Server I want to get the data to be imported). In this case, the data comes from the remote SQL server. I use SQL Server Authentication, fill the user name and password and select the Database that’ll provide the data. In Server name field I used the IP address of my remote server.

SQL Server Import and Export Wizard - Choose a Data Source

Now is the time to choose the Destination. In this case it’s my local SQL Server instance. LENIEL-PC is my Windows 7 machine name. SQLEXPRESS is the default SQL Express server name. I also selected the Database I want to receive the imported data. As you see both databases (source and destination) have the same name to make things easy.

SQL Server Import and Export Wizard - Choose a Destination

The wizard then asks if I want to copy data directly from table to table or if I want to write a custom SQL query to shape the data according to specific needs. I’ll select the first option to keep things simple.

SQL Server Import and Export Wizard - Specify Table Copy or Query

In the next screen we get a beautiful list of available tables and views. I’ve selected two tables that I want to copy.

SQL Server Import and Export Wizard - Select Source Table and Views

If you click the Edit Mappings… button (which applies to each table selected in the grid), I can also specify if I want to delete rows before importing data or if I want to append rows to the destination table. I’ll append rows because I have already erased both tables (exams and files) on my local database.

Apontando para cima Table files is linked to table exams through a foreign key. I disabled this foreign key constraint in my local database to avoid conflicts/errors during the import process. Beware because conflicts do appear most of the time. Conflicts are generally related to key violation/duplication.

SQL Server Import and Export Wizard - Column Mappings

As I’m using the free SQL Server Express Edition, the following screen has an interesting information:

In SQL Server Express, Web, or Workgroup, you can run the package that the Import and Export Wizard creates, but cannot save it. To save packages that the wizard creates, you must upgrade to SQL Server Standard, Enterprise, Developer or Evaluation.

This is telling me that I’ll have to go over the wizard (#7 steps) again if I happen to execute this very import process. Really annoying… Irritado … but that’s the price you must pay when using free tools.

SQL Server Import and Export Wizard - Run Package

Next button once more (the last one hallelujah) and we’re ready to go. It’s just a matter of reviewing what’ll be done and pressing the finish button.

SQL Server Import and Export Wizard - Complete the Wizard

After clicking Finish I got this last window:

SQL Server Import and Export Wizard - Execution Successful

Hope this simple and visual guide helps you understand how the data Import and Export processes work when dealing with a SQL Server database inside SSMS.

Free tools
Microsoft SQL Server Express Edition
Microsoft SQL Server 2008 Management Studio Express (SSMSE)

Oracle NLS_LANG for language/territory/character set

Recently I had an Oracle problem at work.
I kept postponing to look with care at this problem till today.

Problem
All accented characters as à, á, é, í, ó, ú, â, ê, î, ô, û, ã, õ, etc and other ones as ç (cedilla), º, ª (ordinal indicators), were being replaced by erroneous characters as this example in the header of a stored procedure:

=================================================================
-- Revis¿o : 'N¡ã 0
-- Descri¿¿o: // description here
-- Data   : 03/29/2010
-- Autor  : CHEMTECH
=================================================================

This problem gets worse if you have something as

SELECT CONTRACT.NUM_CONTRACT,
           'N¡ã' || TO_CHAR(CONTRACT.NUM_CONTRACT) || ' - ' ||

The result of the above SQL query is used in the UI, that is, the user will see the wrong characters ¡ã instead of º. If you commit this thing in the repository you mess everything up. You see what I mean…

Questioning
At first I thought the problem was the database encoding settings.

I and only I was getting crazy characters. My development group has more 3 people and no one was having this problem. This clearly should be a UI configuration, that is, how the data is shown on my machine.

A little bit of research on the net took me to Oracle’s NLS_LANG FAQ.

This is what NLS_LANG stands for/does:

A locale is a set of information addressing linguistic and cultural requirements that corresponds to a given language and country. Traditionally, the data associated with a locale provides support for formatting and parsing of dates, times, numbers, and currencies, etc. Providing current and correct locale data has historically been the responsibility of each platform owner or vendor, leading to inconsistencies and errors in locale data.

Setting the NLS_LANG environment parameter is the simplest way to specify locale behavior for Oracle software. It sets the language and territory used by the client application and the database server. It also indicates the client's character set, which corresponds to the character set for data to be entered or displayed by a client program.

I started reading the FAQ (top bottom approach) and tried the following query:

SELECT * FROM NLS_SESSION_PARAMETERS;

which gave me this result:

NLS_SESSION_PARAMETERS on my development machine
Figure 1 - NLS_SESSION_PARAMETERS on my development machine

As you see NLS_LANGUAGE and NLS_TERRITORY are set to AMERICAN and AMERICA respectively on my machine.

I asked Thiago Arakaki a coworker of mine to execute the same SQL query on his dev box. To my surprise and delight this was the screen he got:

NLS_SESSION_PARAMETERS on Thiago's development machine
Figure 2 - NLS_SESSION_PARAMETERS on Thiago Arakaki’s development machine 

As you see NLS_LANGUAGE and NLS_TERRITORY are set to BRAZILIAN PORTUGUESE and BRAZIL respectively on his machine.

This clearly pointed what could be the cause of the problem I was having.

I couldn’t understand why my machine had a different NLS_LANGUAGE and NLS_TERRITORY.

I read the FAQ a little bit more and got to this part Where to set the NLS_LANG in Windows.

I checked Windows registry to make sure the NLS_LANG subkey was set correctly on Oracle home. I got this:

NLS_LANG subkey on Windows RegistryFigure 3 - NLS_LANG subkey on Windows Registry

For my surprise it was correct. Look at NLS_LANG. It has BRAZILIAN PORTUGUESE_BRAZIL.WE8MSWIN1252 just like Thiago Arakaki’s dev box.

So why do I still was getting wrong characters inside Allround Automations PL/SQL Developer when working on the same database that Thiago also uses? Aha, that’s the question this post tries to clarify if you come to get in this same situation.

Reading the FAQ a little bit more… :o) I saw that you could also set NLS_LANG as a System or User Environment Variable, in System properties. So I went there to check if I already had such a thingy set on my machine. Again for my surprise this thingy was there. I don’t remember if/when I created this variable.

The only thing I thought at the moment was the interference that some previous installations of Oracle could’ve caused on my machine. Today I’m using Oracle express but in the past I installed the full Oracle server. I can’t state for sure what created the NLS_LANG var on my Windows System variables.

The screen I got was something like this:

 NLS_LANG variable on Windows System variables
Figure 4 - NLS_LANG variable on Windows System variables

I had AMERICAN set for the NLS_LANG variable.

Solution
I then changed the system variable NLS_LANG to BRAZILIAN PORTUGUESE_BRAZIL.WE8MSWIN1252. Clicked OK. Closed PL/SQL. Opened PL/SQL and issued the SQL query described on the beginning of this post and could see that the NLS_LANGUAGE and NLS_TERRITORY rows had changed appropriately to BRAZILIAN PORTUGUESE and BRAZIL respectively. After that I started to get the right characters throughout the database.

Great! Problem solved.