DataMining\WebScrapping in Big Data Platform
What is Web Scrapping ?
Web has lot and lots of data and we do need sometime to research on some specific set of data but we cant assume all the data makes sense.
<div class="island summary"> <ul class="iconed-list"> <li class="biz-hours iconed-list-item"> <div class="iconed-list-avatar"> <i class="i ig-biz_details i-clock-open-biz_details"></i> </div> <div class="iconed-list-story"> <span> Today <span class="hour-range"><span class="nowrap">10:00 am</span> - <span class="nowrap">10:00 pm</span></span> </span> <span class="nowrap extra open">Open now</span> </div> </li> <li class="iconed-list-item claim-business"> <div class="iconed-list-avatar"> <i class="i ig-biz_details i-suitcase-red-star-biz_details"></i> </div> <div class="iconed-list-story"> <a href="https://biz.yelp.com/signup/Z_oAg2AmqZEtu5hlfPruNA/account"> <b>Work here?</b> Claim this business </a> </div> </li> </ul> </div>
Above is just a script and it was loads of things and if you give me this set , I'll be like what the hell is this?
But if you concentrate on the data , you can derive few important info like it has Some Time written 10:00 am and some open now text so we can assume some Object opens at 10:00 am .
Well so you read an url and may get lots of data useful or useless but there may be some data which does make sense and to extract that data is what Web Scrapping.
Now Data Mining is like dig dig and keep digging those data to extract information which some time like puzzle.
Since I use Java , so I tried with Apache Nutch which a Scalable Web Crawler , so just extract data from nutch and dump the same in Apache Solr for fast Indexing.
Apache Nutch - http://nutch.apache.org/
Apache Solr - http://lucene.apache.org/solr/
Both does the job quite well but still why BigData ????
Why BigData in Web Scrapping
Now consider a scenario where I needed to match the GPS coordinates of a particular destination available in the entire YellowPage websites.
So using simple DFS we can assume we can collect some 10000+ websites to have the record.
Now assume we have some extremely untidy data and we are trying to find a GPS coordinate on them .
Logically since we need to extract a pattern from all these websites to get the GPS coordinate(and then match) , now we need to run may be some very complex Regex and it will surely drain the entire memory.
So now just thinking logically , what if we try to run the entire process parallel ...
Bingo !!!!!!!!!!!!!!!!! Big Data came to picture now.
I will first write a script to pull all data from all the urls and dump in HDFS or Spark and then the REGEX expression I will run over spark to get the quick result.
So if want to conclude , if we have incremental process and scrapping depth is higher, then its really helpful to use BigData setup to do the same .