Data Engineer working with multiple Big Data technologies and Machine Learning

Tuesday, May 12, 2015

Trouble Connecting Apache Spark with Hbase due to missing classes

Ideally when you try to connect HBase with Apache Spark, in most of the cases, it throws exception like ImmutableBytesWritableToStringConverter or Google Utils not found and various other errors while trying to run.

Almost all belongs to the same family of missing drivers.

To solve it straight forward,

Just go to spark-defaults.conf
update your spark.driver.extraClassPath with required libraries. keep on adding them.

like for missing ImmutableBytesWritableToStringConverter , add spark-examples-1.3.1-hadoop2.4.0.jar.

spark.driver.extraClassPath /Users/abhishekchoudhary/anaconda/anaconda/lib/python2.7/site-packages/graphlab/graphlab-create-spark-integration.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-server-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-protocol-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-hadoop2-compat-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-client-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-common-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/htrace-core-2.04.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-examples-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-assembly-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/guava-12.0.1.jar

And one more thing , its actually ultra fast to access Hbase using Spark , so real-time updates

Sunday, May 10, 2015

HBase ignore comma while using bulk loading importTSV

HBase simply ignore Text while importing the same with CSV file, and the best part it didn't even inform you.
Entire job will be passed , but your HBase table won't have any data or partial data , like if any column has some values

"this text can be uploaded , but it has more" , then till uploaded it will be there in HBase Table cell , then rest of the contents are gone.
This is because I was importing TSV with seperator comma (,) and that lead to import engine to ignore comma among the csv cell.

It took 32 YARN jobs to figure out the actual issue.

Import CSV command -

create ‘bigdatatwitter’,’main’,’detail’,’other’

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,other:contributors,main:truncated,main:text,main:in_reply_to_status_id,main:id,main:favorite_count,main:source,detail:retweeted,detail:coordinates,detail:entities,detail:in_reply_to_screen_name,detail:in_reply_to_user_id,detail:retweet_count,detail:id_str,detail:favorited,detail:retweeted_status,other:user,other:geo,other:in_reply_to_user_id_str,other:possibly_sensitive,other:lang,detail:created_at,other:in_reply_to_status_id_str,detail:place,detail:metadata" -Dimporttsv.separator="," bigdatatwitter file:///Users/abhishekchoudhary/PycharmProjects/DeepLearning/AllMix/bigdata3.csv

Tuesday, May 5, 2015

About Apache Spark, Lightning-fast cluster computing (Big Data)

Please read the -

About Apache Spark, Lightning-fast cluster computing (Big Data)

Monday, March 30, 2015

Internal of Hadoop Mapper Input to customize

Internal of Hadoop Mapper Input

Well I just got a requirement to somehow change the input split size to the Mapper , and just by changing the configuration didn't help me lot, so I moved further and just tried to understand exactly whats inside -

So above the Job flow in Mapper and the 5 methods are seriously something to do with split size.

Following is the way , an input is being processed inside Mapper -

- Input File is split by InputFormat class

- Key value pairs from the inputSplit is being generated for each Split using RecordReader

- All the generated Key Value pairs from the same split will be sent to the same Mapper, so a common unique mapper to handle all key value pairs from a specific Split

- All the result from each mapper is collected further in Partitioner

- The map method os called for each key-value pair once and output sent to the partitioner

- so now the above result in partioner is actually further taken into account by Reducer.

So Now I found the class InputFormat to just introduce my change and that is based on my requirement.

But further checking the exact class helped me more -

 @Deprecated  
 public interface InputSplit extends Writable {  
  /**  
   * Get the total number of bytes in the data of the <code>InputSplit</code>.  
   *   
   * @return the number of bytes in the input split.  
   * @throws IOException  
   */  
  long getLength() throws IOException;  
  /**  
   * Get the list of hostnames where the input split is located.  
   *   
   * @return list of hostnames where data of the <code>InputSplit</code> is  
   *     located as an array of <code>String</code>s.  
   * @throws IOException  
   */  
  String[] getLocations() throws IOException;  
 }

Further there few more things to check like TextInputFormat , SequenceFileInputFormat and others

Hold On.. We've RecordReader inbetween which splits the input in Key-value and what if I got something to do with it-

RecordReader.java Interface

We can find implementation of RecordReader in LineRecordReader or SequenceFileRecordReader.

Over there we can find that input split size crosses boundary sometimes, and such situation is being handled , so custom RecordReader must need to address the situation.

Monday, March 9, 2015

MachineLearning BigData Infograph

Tuesday, February 17, 2015

Things to know before Big Data & Machine Learning

When I started with Big data , I started with Hadoop .
When I started with Machine learning , I started with Linear Regression .

But time by time I realized I didn't make the best choice to do so.
why ?????????????????

Because I missed the core of the technologies but even I managed to finish the job but that doesn't mean I did it right. I missed the gap between learning and knowing the technology , I missed the fundamental behind the specific .

So I personally preferred following before Big Data and Machine Learning-

Don't confuse.. well Big Data & Machine Learning are 2 different things but they need each other. And You will know it once you do it :-)

Things to Do before Map-Reduce -

First understand Map-Reduce DATASTRUCTURE.
Write your own implementation of Map Reduce which is ultra easy without using any framework like hadoop or Spark.
Refresh Graph Technologies like DFS & BFS
Explore some basic Dynamic Programming and Greedy Algorithm like Knapsack , LCS, Floyd Warshell , KMP etc.
SQL or RDBMS or precisely Data Model , Relational Algebra

Things to do before Machine Learning -

Mathematics
Vector , Scalar
Matrix Multiplication , Addition and other basic operations
Linear Formulation
Probability , conditional and independent
Probability Distribution
Basics of Permutation and Combination
Hypothesis
Very basic Statistics like Mean, Median , Standard Deviation variation
Regression

Ok seems like lot to do before you even start of.... Well practically its not.

Everything is either from High School or College, so ideally you just need to refresh your memory and it will actually bring excitement to start with .

Warning : This list is going to grow further :-)

Monday, December 29, 2014

Why did I use BigData Technology (Spark) for Machine Learning (NLP)

BIG DATA WITH NLP

I am here to do Sentiment Analysis of twitter dataset and trying to make it a generalize platform irrespective of twitter or any other source , so dimension and size of training data is actually large and it will grow on.

As a developer of Machine Learning, I thought of trying Python as the technology and for trial I was having around 1600000 records from http://help.sentiment140.com/

Why not R?

Well first I found R is not same traditional way of coding so comfort was question but making things in R is extremely easy. So that can't be the only reason to try in python.

Python is nice language and NLTK library made NLP very easy and loads of example are there and at-least NLP is very powerful with nltk library along with that scikit-learn library (machine language algorithms).

Its very easy to code in Python compare to R in my view.

Brilliant support with Hadoop Streaming for python so I will vote up for Python again.

So here I got an opportunity to try something new and thats always exciting :-)

(I am not saying Python is better than R , as I use both of them , in this particular scenario , for me , python stood)

So how to do that ?

Well Sentiment Analysis is not just 1 thing that I get the text and ran algorithm and here we go. Its never like that , once you get data you need to decide polarity on it I mean first need to find
any statement is positive or negative
Then create a feature set
The find stop words
Then bla bla bla ...
Then finally train it
Then test accuracy
Then finally predict it

I means it has loads of step before reaching the goal so it no way an ordinary task but once you are clear with the concept, its neither tough.
check Text Processing

Whats more ?

So I first tried with small dataset of around 400 records having 100 test records and ran NaiveBayes from nltk

 classifier = NaiveBayesClassifier.train(X_train)

and it ran in very small time of around 10 sec and then some accuracy test with precision test more 10-20 seconds and thats done.
The file size was approx. 500kb thats all and it took 20 to 30 seconds.

Cool thats brilliant , 30 seconds and all set.
But is it enough ? Are we set ?

Then why do I need Bigdata framework ?

On exploring more I found there is a brilliant sentiment dataset available in sentiment140 and the file size was about 250mb or precisely the records number are 1600000, thats a pretty decent size for getting a proper result of sentiment analysis.

So I was all set and ran the same algorithm with this new dataset or file. The file contains positive and negative statements collected from twitter so I changed the file name and ran the python script.

Damm!!! my system is quite good config even memory exception . What... can't I run this, well no I can't. I tried several times and I tried the same with high config Windows as well as Mac machine and same thing, Memory Error.

So I was left with only one choice , try the same in Bigdata platform and see atleast it must run there and I ran the same program but in Hadoop YARN environment & Spark and finally it ran but at what cost ???
It took more than 14 minutes to finish the job.

So 14 minutes for 1600000 records , but I need to deal with 20 times of this size , how the hell I am gonna do that.
20 * 14 minutes in simple mathematics and that won't happen because resource utilisation , memory factor will anyway raise the time more and more.

So is it really a feasible option?

Now Spark fits here .

Spark has a brilliant library MLlib which contains almost all known popular Machine Learning API's Spark Library
As I already told that I am using NaiveBayes, Spark has already API

So I need to change in the code, because Spark works in the concept of RDD ie resilient distributed dataset RDD Concept

I am surely not gonna explain the changes but I must admit that there were pretty important changes almost everything to parallelise the process the basic concept behind Bigdata framework including Spark.

Now I was set and ran the Spark in my local system and YARn (no cluster) and surprisingly it took around 2.4 minutes and I added more extensive operation on the code and maximum it went till 3.8 minutes. Find below -

Conclusion

In current days, when there is unlimited data source then its very hard to restrict yourself if there are technologies available to handle the size and dimension of data and Spark is an excellent fit for Big Data but in machine learning its not only about running the algorithm but ignoring those now BigData platform or technology stack is actually brilliant source.

So moving on I am going to use Spark for Text processing but I got one more great option GraphLab , so once I try that I will compare Spark and GraphLab and initial research showed me GraphLab is little bit faster than Spark

All my codes are already in github

Proper Sentiment Analysis in Spark
Python NLTK Naive Bayes Example
Analysis in YARN Long Running

Tuesday, November 25, 2014

Use Python for Recommendation Engine

Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

Var1 Freq

N/A 405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.

What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating

So now my dataset is quite preety

	imdbRating	imdbVotes	imdbID	tomatoRating	tomatoUserReviews
1	6.20	89960		6.20	3504356
2	6.40	5108	tt2678948	6.30	152
3	6.80	35	tt0375174	6.30	125
4	7.20	99763	tt0087538	6.90	314496
5	6.90	231169	tt1068680	5.30	316060
6	7.70	291519	tt0289879	4.80	621210
7	6.90	174047	tt1401152	5.80	74879
8	7.20	1038	tt0452580	6.30	253
9	4.90	10381		6.30	6006
10	7.20	28	tt0297197	6.30	6006
11	4.90	10381	tt1278160	6.30	6006
12	4.90	10381		6.30	6006
13	4.90	10381		6.30	6006
14	3.20	690	tt2186731	6.30	862
15	6.20	5594	tt1252596	6.30	2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation

Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances

Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies

Entire Source Code - git link

Few More analytic in R will be uploaded.

What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .

Recommendation Engine using Python

What is Recommendation System

Well if you are reading this you anyway know what is recommendation but basically its giving a suggestion to the user based on certain Parameters like in Amazon , on buying books it gives you suggestion , in IMDB on checking any movie, it gives you recommended movies etc.

Now lets think how is that possible?

Answer - its Science or I must say Mathematics.

Let's play with some normal logic here-

Lets consider Movie 'The Godfather'

Now if somebody likes the same, can we recommend the same user any superhit comedy movie ... aah Naah... Not possible , so naturally we would like to recommend next best movie under Genre Crime or Drama or Thriller.

So thats a recommendation.

Hold on!... As a human I can search and find but hows is that possible using computing languages.

So now we got a point and I'd like to push you all in past , schooling days.

'Euclidean Distance or Manhattan Distance' ..

Are those words seems to be familier to you .. well at first sight it may be like "What d hell this guy is talking about" or what ... why should I know mathematics here.

Because its mathematics only :-)

The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies" distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

For more details , Google and know the actual concept behind them. That's important.

What Exactly I did with R and Python

I used R for do quick analytics on my data and used python to write the main algorithm behind it.

I must say everything I did in Python, I could have been done in R as well but I used python because I am little more comfortable in programming language rather than scientific one ie R. My personal opinion and please don't believe me :)

Concept behind the entire approach

If I have a collection of movies , then I can extract details about each of the movies like user ratings , reviews , Genre and all other things easily and if that is possible then I can use any distance algorithm like euclidean to find the similarity between any 2 movies.

Now I will pick the movies in each of my friends' list from facebook and then will compare the same with the list of unwatched movies and sort them based on distance , so now I have a list of movies recommendation start with the best till the end.

Is it actually so easy ????

Explanation based on Programming

First I connected to Facebook using Oauth and REST so that I could get access to my friends and then the movies liked by my friends.

 graph = facebook.GraphAPI(access_token)  
 profile = graph.get_object("me")  
 friends = graph.get_connections("me", 'friends')['data']

Then I used algorithm to get movies like by my friend

 allLikes = graph.get_connections(friend['id'], "movies")['data']

Now next job is to get the details about which of the movies like rating and all other things and finally I got following contents-

	MOVIE	Year	Released	Genre	Director	Poster	imdbRating	imdbVotes	imdbID	tomatoRating	tomatoUserReviews	BoxOffice
2	Jilla	2014	10/01/14	Action, Drama, Thriller	R.T. Neason	http://ia.media-imdb.com/images/M/MV5BOTUxNzExOTA0NF5BMl5BanBnXkFtZTgwMTUzNTAxMjE@._V1_SX300.jpg	6.4	5108	tt2678948	N/A	152	N/A
3	Pursuit of Happyness	2005	16/07/05	Documentary	Patrick McGuinn	http://ia.media-imdb.com/images/M/MV5BMTk4NjQ2NzI5Nl5BMl5BanBnXkFtZTcwOTIzNTM0MQ@@._V1_SX300.jpg	6.8	35	tt0375174	N/A	125	N/A
4	The Karate Kid	1984	22/06/84	Action, Drama, Family	John G. Avildsen	http://ia.media-imdb.com/images/M/MV5BMTkyNjE3MjM2MV5BMl5BanBnXkFtZTYwMzY5ODk4._V1_SX300.jpg	7.2	99763	tt0087538	6.9	314496	N/A
5	Yes Man	2008	19/12/08	Comedy, Romance	Peyton Reed	http://ia.media-imdb.com/images/M/MV5BNjYyOTkyMzg2OV5BMl5BanBnXkFtZTcwODAxNjk3MQ@@._V1_SX300.jpg	6.9	231169	tt1068680	5.3	316060	$97.6M
6	The Butterfly Effect	2004	23/01/04	Sci-Fi, Thriller	Eric Bress, J. Mackye Gruber	http://ia.media-imdb.com/images/M/MV5BMTI1ODkxNzg2N15BMl5BanBnXkFtZTYwMzQ2MTg2._V1_SX300.jpg	7.7	291519	tt0289879	4.8	621210	$57.7M
7	Unknown	2011	18/02/11	Action, Mystery, Thriller	Jaume Collet-Serra	http://ia.media-imdb.com/images/M/MV5BODA4NTk3MTQwN15BMl5BanBnXkFtZTcwNjUwMTMxNA@@._V1_SX300.jpg	6.9	174047	tt1401152	5.8	74879	$63.7M
8	A Year Ago in Winter	2008	06/01/10	Drama	Caroline Link	http://ia.media-imdb.com/images/M/MV5BMTQ4MTUzNTIwM15BMl5BanBnXkFtZTcwMTEzMjA0Mg@@._V1_SX300.jpg	7.2	1038	tt0452580	N/A	253	N/A
10	James Bond 007	1983	N/A	Adventure, Animation, Action	N/A	N/A	7.2	28	tt0297197	N/A	N/A	N/A
14	Department	2012	18/05/12	Action	Ram Gopal Varma	N/A	3.2	690	tt2186731	N/A	862	N/A
15	Ajab Prem Ki Ghazab Kahani	2009	06/11/09	Comedy, Romance	Rajkumar Santoshi	http://ia.media-imdb.com/images/M/MV5BMjA0NjAwNzYxOV5BMl5BanBnXkFtZTcwNzA4NTk5Mw@@._V1_SX300.jpg	6.2	5594	tt1252596	N/A	2436	N/A

Entire code to get the data can be found facebook_movies_dataset

Since now I have data, so why not run some quick analytic on them and see have I really did any good job -

So from the graph above , we can easily see that 3 idiots is the most viewed movie among my FB movies list and Gaurav Shr.. has liked the most number of movies , so naturally he doesn't have any work :)

Source Movies_Analytics

So Now I have dataset which give me some values and and now I can work on my recommendation engine... if you have taken some breath .. lets move on to next important thing ... what to code and how to shape the data...

'This is my first-ever program in Python, so can't claim it as a very great code'

Part2

Wednesday, November 5, 2014

Quick analysis on new Mobile available in Market and Sentiments using Twitter api

Phone Analysis and Sentiment

Heard lots about iPhone6 and now Nexus 6 , so finally I decided to try my first ever twitter analytic test using R and the objective is to find how people are talking about these phones.

Well to honor Samsung , I added it too.

Few things to know before starting twitter analytic using R (not from expert point of view) -

1. Its Very easy and R will make your life ultra easy. Python is as well great for the same but I will post about that sometime later.

2. To do quick analytic there is no point of running behind BigData , simply use less records and do the job.

3. Number of records does change the fate of your analytic , so its always good to run the same on large data volume so do this step once your bigdata setup is done and you already have experience working on casandra , spark etc.

4. To learn this you just need to know any technology and you must love technology, then learning R and do Rstats won't be trouble.

5. Some basic libraries like 'twitteR' , 'plyr' , 'stringr' and 'ggplot2' for twitter and statistics, do some basic hands-on on them and then roll.

6. If you are good in mathematics , that wonderful but you need not to be best in that, just basic school mathematics concepts are enough for starter..atleast what I felt.

Once I ran the stats-

Great iPhone still leads in all +ve side and Nexus 6 based on data , actually not getting that positive vibe , well then why i heard that ...

Again all these stats can change a lot if I try the same with more data.

The source code with comments are already uploaded in my github RsourceFile, just explore ..

Happy Coding..

Monday, October 27, 2014

Some interesting things about Regression

Residuals

Once we run the linear regression model, we get residuals as summary, so to determine more about residuals, there are some interesting points I accidentally found and it helped me so I am drafting ...

Residuals, have mean Zero so it does mean that residual is balanced among the data points , so no pattern its just scattered and there will almost equal positive and negative.

so if I run the linear regression in R-

fit <- lm ( relation ~ person , data = people) ,

so to justify the theory , just do the simple mean of residuals-

mean(fit$residuals) - must give a value very close to zero.

There is no correlation between residuals and predictors.

cov(fit$residuals, people$person) .

covariance

While googling I found a new equation -

var(data) = var(estimate) + var(residuals)

Least Square

Regression line is the line through the data which has minimum (least) squared 'error', the vertical distance between actual predictor and the prediction made by line.

Squaring the distances ensures the data points above and below the line are treated the same.

The method of choosing the 'best regression line' (or fitting a line to the data) is known as ordinary least squares.

Thursday, October 16, 2014

Why did I try Residual Plot over my dataset

Why Residual Plot

Well I had previously a sick data of India Population, which was correct but I altered it and made it worst. So now the population based on year is linear increasing suddenly it touched the pinnacle so ideally this data is never meant for Linear Regression but bound with my habit , I ran linear regression on them and found this -

Ok above is the linear line I got and that is terrible , believe because I ran the predictor and I got brilliant bad result :(

For year 1800,2030,2040 I got
1 2 3

-11839.78 824736.40 861109.28

So it does man there was no India in map :O , what that's not possible I messed it up ...
Well i already mean it to make the data work properly , but nothing helped.

So Now I knew that I need to transform my data to some format so I searched on internet and found some keyword named Residual Plot.

Well, what again new concept, why should I learn this....

Residual is the error between an actual value of dependent variable and predicted value. So avoiding all these mind blowing keyword behind , I finally derived that its a way to find a model is a 'good fit' or not.

There is 2 very basic and easy thing to remember in residual plot-
1. The residuals for the 'good' regression model are normally distributed, and random.

2. The residuals for the 'bad' regression model are non-Normal, and have a distinct, non-random pattern.

So from above , we can see a sure-shot case of bad data and model and I know surely this model is bad as my model definitely shows a pattern, superb pattern of growing....

More , by chance I need more-

If I know this far, I must draw a conclusion by drawing some example with a good fit of data within model , lets see hows residual looks-

Following Sample data -

x <- runif(100,-3,3)

y <- x+ sin(x) + rnorm(100,sd =.2)

and I got -

Good one, isn't it , but lets not be in hurry, lets see the Residual plot-

What I can see a pattern now , I sin wave, ahh... so it says the model looks to be good but it not, so just don't go with scatter plot or model, there may be trouble inside, there is no harm to run the residual plot .