ShareThis

Monday, November 7, 2011

[JAVA] Thread Crawler & Bulk Image Downloader


My latest project; a thread crawler and bulk image downloader through Java. This project was easier than expected with only a few roadbumps on the way that were solved pretty quickly.

Here is how the process works:

When you first start the application, it asks to paste  all urls (currently only accepts one at a time, but future versions will allow bulk copy paste.

So, lets crawl a thread. For example, I will use this thread:

From here, the program will analyze the links on this page. It will save all same-domain links and crawl those for images (this is handy when most picture sites have thumbnails that lead to full-size images). Those "children" URLs are crawled after the main url is crawled, and any image over a given size (Im using 15KB) is then saved via byte Stream to the project folder. The byte Stream is not the fastest utility, but it works ok and gives you time to hunt other picture sites while it works.

From here, the process iterates over the  given list of user-submitted URLs and those URLs' children URLs; scouring them all for pictures.

Download log:

And now to see the final product...


Final notes:

Ive noticed some imperfections with the code as far as websites go. Some websites have had more problems parsing than others, and popular items such as Google Images don't work at all. Future implementations will contain special cases for those popular websites.

Business Aspect:

While this was a relatively easy project; there is room for marketing it as an independent application (although, in such an early beta, it would not go live for quite some time.) It would require some reworking and polishing, as well as developing an attractive, and easy-to-use graphical user interface.

I hacked together a simple GUI for it:

However, this is done in NetBeans while my project is in Eclipse. I dont necessarily have the patience to port either to the other, so while Im just using this for personal use, I'll stick to Eclipse's System.in.











2 comments:

Nice idea but your app doesnt seem to work , I cant download any pics

I didn't post the source code... The code snippet I posted is just a pic of part of the process.

I may be sharing this later. It depends on where I want to go with it. I am currently working on multi-threaded fully-automated crawlers.

Post a Comment