Showing posts with label Eclipse. Show all posts
Showing posts with label Eclipse. Show all posts

Java web crawler searcher robot that sends e-mail

This java crawler is extremely useful if you need to search a webpage for a specific word, tag or whatever you want to analyze in the data retrieved from a given URL.

I’ve used it for example to search for a specific error message that appeared in a page when a connection to the database could not be done. It helped me to prove that the error was really caused as a consequence of the connection link failure to the database.

The crawler saves in the file system the page that contains the string you’re searching for. The name of the file contains the time from when the string was found within the page body. With this information I could match the time information present on the file name with the time accompanying the error present in the web server log.

The code was originally developed by Rodrigo Gama that is a fellow developer/coworker of mine. I just adapted the code a little bit to fit my needs.

What’s the idea behind the crawler?
The main idea behind the crawler is the following:

You pass 2 essential parameters to run the application - these are the string you want to search for and the URLs you want to verify.

A thread for each URL is then created. This is done using the PageVerificationThread.java class that implements Runnable.

The PageVerificationThread creates a notificator object that is responsible for calling the MailSender object that in its turn sends a notification (message) to the emails you hardcoded in the Main.java class.

The message is also hardcoded inside the run() method of PageVerificationThread class.

I advise you to read the comments in the code.

You’ll have to change some strings in the code as is the case of the username and password used to send the e-mails.

The Code
The crawler has 4 classes: MailSender.java, Main.java, Notificator.java and PageVerificationThread.java.

This is the Main class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.List;

public class Main
{
    public static void main(String[] args) throws MalformedURLException
    {
        // URLs that will be verified
        List<String> urls = new ArrayList<String>();

        // Emails that will receive the notification
        String[] emails = new String[]
        { "youremail@gmail.com" };

        // Checking for arguments
        if(args == null || args.length < 2 || args[0] == null)
        {
            System.out.println("Usage: <crawler.jar> <search string> <URLs>");

            System.exit(0);
        }
        else
        {
            // Printing some messages to the screen
            System.out.println("Searching for " + args[0] + "...");
            System.out.println("On:");

            // Showing the URLs that will be verified and adding them to the paths variable
            for(int i = 1; i < args.length; i++)
            {
                System.out.println(args[i]);

                urls.add(args[i]);
            }
        }

        // For each URL we create a PageVerificationThread passing to it the URL address, the token to
        // search for and the destination emails.
        for(int i = 0; i < urls.size(); i++)
        {
            Thread t = new Thread(new PageVerificationThread(urls.get(i), args[0], emails));

            t.start();
        }
    }
}

This is the PageVerificationThread class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Calendar;
import java.util.Properties;
import java.util.TimeZone;

public class PageVerificationThread implements Runnable
{
    private String                strUrl;
    private String                searchFor;
    private static Notificator    notificator = null;
    private static Object         lock        = new Object();
    private int                   numAttempts = 0;

    public PageVerificationThread(String strUrl, String searchFor, String[] emails)
    {
        this.strUrl = strUrl;
        this.searchFor = searchFor;

        synchronized(lock)
        {
            if(notificator == null)
            {
                notificator = new Notificator();

                // For each email, adds it to the notificator "to" list.
                for(int i = 0; i < emails.length; i++)
                {
                    notificator.addDesetination(emails[i]);
                }
            }
        }
    }

    public void run()
    {
        try
        {
            URL url = new URL(strUrl);

            // Time interval to rerun the thread
            float numMinutes = 1;

            while(true)
            {
                try
                {
                    Properties systemProperties = System.getProperties();
                    systemProperties.put("http.proxyHost",
                            "proxy.yourdomain.com");
                    systemProperties.put("http.proxyPort", "3131");
                    System.setProperties(systemProperties);

                    URLConnection conn = url.openConnection();
                    conn.setDoOutput(true);

                    // Get the response content
                    BufferedReader rd = new BufferedReader(
                            new InputStreamReader(conn.getInputStream()));
                   
                    String line;
                   
                    StringBuilder document = new StringBuilder();

                    // A calendar to configure the time
                    Calendar calendar = Calendar.getInstance();
                    TimeZone tz = TimeZone.getTimeZone("America/Sao_Paulo");
                    calendar.setTimeZone(tz);
                    calendar.add(Calendar.SECOND, 9);
                    String timeStamp = calendar.getTime().toString();

                    boolean error = false;

                    // For each line of code contained in the response
                    while((line = rd.readLine()) != null)
                    {
                        document.append(line + "\n");

                        // If the line contains the text we're after...
                        if(line.contains(searchFor))
                        {// "is temporarily unavailable."))
                            // {
                            error = true;
                        }
                    }

                    // System.out.println(document.toString());
                   
                    // If we found the token...
                    if(error)
                    {
                        // Prints a message to the console
                        System.out.println("Found " + searchFor + " on " + strUrl);

                        // Sends the e-mail
                        notificator.notify("Found " + searchFor + " on " + strUrl);

                        // Writing the file to the file system with time information
                        FileWriter fw = null;

                        try
                        {
                            String dir = "C:/Documents and Settings/leniel-macaferi/Desktop/out/" + strUrl.replaceAll("[^A-Za-z0-9]", "_") + "/";

                            File file = new File(dir);
                            file.mkdirs();
                            file = new File(dir + timeStamp.replaceAll("[^A-Za-z0-9]", "_") + ".html");
                            file.createNewFile();

                            fw = new FileWriter(file);
                            fw.append(document);
                        }
                        finally
                        {
                            if(fw != null)
                            {
                                fw.flush();
                                fw.close();
                            }
                        }
                    }

                    // If error we reduce the time interval
                    if(error)
                    {
                        numMinutes = 0.5f;
                    }
                    else
                    {
                        numMinutes = 1;
                    }

                    try
                    {
                        Thread.sleep((long) (1000 * 60 * numMinutes));
                    }
                    catch(InterruptedException e)
                    {
                        e.printStackTrace();
                    }

                    // A counter to show us the number of attempts so far
                    numAttempts++;

                    System.out.println("Attempt: " + numAttempts + " on " + strUrl + " at " + calendar.getTime().toString());
                }
                catch(IOException e)
                {
                    e.printStackTrace();
                }
            }
        }
        catch(MalformedURLException m)
        {
            m.printStackTrace();
        }
    }
}

This is the Notificator class:

/**
 * @author Rodrigo Gama
 * @year 2009
 */

import java.util.ArrayList;
import java.util.List;
import javax.mail.MessagingException;
import javax.mail.internet.AddressException;

public class Notificator
{
    private List<String>    to   = new ArrayList<String>();
    private String          from = "leniel-macaferi";

    public void addDesetination(String dest)
    {
        to.add(dest);
    }

    public synchronized void notify(String message)
    {
        try
        {
            // Sends the e-mail
            MailSender.sendMail(from, to.toArray(new String[] {}), message);
        }
        catch (AddressException e)
        {
            e.printStackTrace();
        }
        catch (MessagingException e)
        {
            e.printStackTrace();
        }
    }
}

This is the MailSender class:

/**
 * @authors Rodrigo Gama
 * @year 2009
 */

import java.util.Properties;
import javax.mail.Authenticator;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeMessage;

public class MailSender
{
    public static void sendMail(String from, String[] toArray,
            String messageText) throws AddressException, MessagingException
    {
        // Get system properties
        Properties props = System.getProperties();

        // Setup mail server (here we’re using Gmail) 
        props.put("mail.smtp.host", "smtp.gmail.com");
        props.put("mail.smtp.starttls.enable", "true");
        props.put("mail.smtp.auth", "true");

        // Get session
        Authenticator auth = new MyAuthenticator();
        Session session = Session.getDefaultInstance(props, auth);

        // Define message
        MimeMessage message = new MimeMessage(session);
        message.setFrom(new InternetAddress(from));

        for(int i = 0; i < toArray.length; i++)
        {
            String to = toArray[i];

            message.addRecipient(Message.RecipientType.TO, new InternetAddress(to));
        }

        message.setSubject("The e-mail subject goes here!");
        message.setText(messageText);

        // Send message
        Transport.send(message);
    }
}

class MyAuthenticator extends Authenticator
{
    MyAuthenticator()
    {
        super();
    }

    protected PasswordAuthentication getPasswordAuthentication()
    {
        // Your e-mail your username and password
        return new PasswordAuthentication("username", "password");
    }
}

How to use it?
Using Eclipse you just have to run it as shown in this picture:

Java Crawler Run Configuration in Eclipse

I hope you make good use of it!

Source Code
Here it is for your delight: http://leniel.googlepages.com/JavaCrawler.zip

Doing maintenance on Chemtech's site

During the second half of July and the first half of August I was working on Chemtech's site doing some maintenance. It was a good job because I could get to know new technology as is the case of Liferay. Liferay is a great enterprise portal that allows you to create a complete website solution coded in Java.

I also could verify the quality of the new Eclipse 3.5 IDE codenamed Galileo that I used during the maintenance. It's a great IDE to Java developers. It has lots of plugins that allow you to work with practically any kind of programming technology inside a fantastic set of windows for every type of task. Before using Eclipse I had only worked with NetBeans to do Java development.

I improved my skills about Tomcat too.

Chemtech's site also known as chemsite

During a time like this you accelerate the learning process and get to know new things which are very important for any software developer.

After fixing some bugs on the site and writing in chemsite’s project wiki everything I grasped and did I came back to Volta Redonda for a two week job on CSN's MES; more on this in the next post.

You see, for the past 10 months I’ve worked with ASP.NET and Oracle (Braskem) and then I switched to work with Java and MySQL for 1 month (chemsite). Now at CSN I'm working with Visual Basic and SQL Server.

This shows that in today's world there's no bullet proof technology when we talk about programming languages and database systems. Each company has its own legacy systems that date back to two or one decade ago and such systems require certain types of interfaces developed in certain types of technologies. For example, if you look to 10 years ago (1999), C# wasn't even a programming language and so Visual Basic predominated during that time. It is most of the times impossible to a company to redevelop a really big system in a new programming language that is today's bullet proof. Those big systems consumed a lot of time and money to be constructed and in the future the programming language that is today’s bullet proof may not stand out.

As software professional you must act with any tool that is put in your hands.

I like what I do and no matter the tool I use I'm always satisfied with my job because of course, I do what I like to do, that is, software development.

To explain why Chemtech is a great place to work for in Brazil I think the above text says it all. In just 1 year I had the opportunity to work in different projects that use different technologies. A great way to leverage a career.

Thanks Jesus for that! :)