Rattle Data Mining Assessment

Rattle Data Mining Assessment

Rattle Data Mining Assessment

This is a solution of rattle data mining assessment in which discuss identification of five key variables contributing most to survival rate of passenger, Role of quality of data in order to ensure proper data warehouse architecture, the titanic uses a decision tree model in rattle data mining tool.

1.0 Introduction

Sinking of Titanic boat was one of the most famous phenomenons which have created one of the most memorable historic events. There were so many passengers travelling through Titanic boat and some of them could survive while other lost their life. In present paper we would try to analyze survival rate of the passengers travelling through Titanic boat, factors which were of immense importance in determining their survival and developing a predictive model in order to determine chances of survival for the passengers based on specific variables.

Present paper would make usage of data collected for passengers in form of their age, sex, pclass, number of siblings in boat, number of parents in boat, body identification number and passenger fair etc. Rattle data mining tool would be used in order to analyze the data with specific tools such as correlation, decision tree and principal component analysis etc. These tools would be helpful in order to find out the variable of importance for the passengers which are helpful to ensure higher survival rate for the passenger. Further predictive modeling would be done based on the decision tree so that a model can be developed in order to predict whether under given circumstance a particular passenger would have survived or not.

2.0 Identification of five key variables contributing most to survival rate of passenger

Identification of the five key components contributing to the survival rate of the passengers for Titanic can be identified with the help of principal component analysis which can be carried out with the collected data. Using principal component method with the rattle data mining tool along with correlation analysis below output can be obtained:
  principal component analysis for Titanic3_score dataset
As shown in figure above that there are five components which have been identified as the principal components and contributing to the event of survival of passenger in Titanic boat. Through these five components found by principal component analysis method we can explain 100% of the variance present in the data. Variance explained by each of the principal components in the current phenomenon can be explained through usage of below table which describes the standard deviation and variance explained by each of the principal components.

PC

PC1

PC2

PC3

PC4

PC5

SD

1.190

1.130

1.019

0.867

0.717

Variance

0.283

0.255

0.207

0.150

0.102

Cumm. Prop.

0.283

0.538

0.746

0.897

1.000

Table 1: Showing the principal component analysis in terms of standard deviation and variance

As explained by above table that principal component 1 can be used in order to explain 28.3% of the variance in the data, principal component 2 for 25.5%, principal component 3 can explain 20.7%, principal component 4 can explain 15% of variability while principal component 5 is able to explain remaining 10.2% of the variance in the data. In order to identify each of the principal components used in the current analysis rotation for each of the principal component as given in the below can be considered:

Factor

PC1

PC2

PC3

PC4

PC5

Age

-0.455

0.180

-0.579

0.641

-0.115

Sibsp

-0.572

0.222

0.532

0.115

0.570

Parch

-0.051

0.761

0.268

-0.112

-0.576

Fare

-0.555

-0.037

-0.383

-0.735

-0.034

Body

0.392

0.579

-0.401

-0.143

0.571

Table 2: Showing rotation variable for principal component identified
 scatter plot for principal component analysis
As given in above table that there are five different principal components for which rotation with each of the relevant variables can be determined. From the above table Sibsp which is number of siblings/spouse aboard is the most important variable which determines the probability of sinking for a person. This principal component is followed by parch which is number of parents/children aboard, age, passenger fare and body identification number of the passenger.  Justification for each of the principal component can be given as below:

  • Age: Passenger’s age was very crucial factor in order to decide the survival rate for the passenger. With higher age of the passenger boarding the boat chances of survival was low hence there was a negative relationship between the two variables.
  • Number of sibling aboard: With higher number of siblings boarding the boat chances of survival for the passenger would be low as passenger himself would be engaged in order to protect his/her siblings as well.
  • Number of parent/children aboard: Similar would be the case with higher number of parents or children boarding the boat there would be lesser chances of survival for the passenger.
  • Passenger fare: With a passenger paying higher fare for the boat there would be higher chances of survival and the two variables would be positively correlated with each other.
  • Body identification no: Passengers having body identification number would be having chances of survival as well. There would not be direct correlation for body identification number of a passenger and his survival chances.

The five components identified above have high degree of relevance among them as well which can be found out through correlation analysis as given below:

correlation among given variables

Factor

Age

Sibsp

Parch

Fare

Survived

Body

Age

1.000

-0.267

-0.147

0.199

-0.052

0.125

Sibsp

-0.267

1.000

0.360

0.151

-0.016

-0.129

Parch

-0.147

0.360

1.000

0.176

0.078

0.0670

Fare

0.199

0.151

0.176

1.000

0.233

-0.044

Survived

-0.052

-0.016

0.078

0.233

1.000

NA

Body

0.125

-0.129

0.067

-0.044

NA

1.000

Table 3: showing the correlation coefficient for each of the correlated variable

As shown in the figure above that different variables showing the sinking phenomenon of Titanic boat have varying degree of correlation among them. Variables with strong color shows strong degree of correlation while variables with light color shows lesser degree of correlation among the variables. Positive correlation among the variables is shown by blue dots while negative correlation is represented through orange dots. Hence from the above figure it can be identified that there is high degree of positive correlation exists between number of parents/children living aboard and number of siblings living aboard. Similar degree of positive correlation exists between passenger fare and survival rate of the passengers.

Similar orange balls represent the negative correlation among the variables hence there is high degree of negative correlation between number of siblings and age of the passengers. Also negative correlation also exists among the variables named survival rate & age, body & fare and no of parents/children residing aboard and age. Among the other variables represented in the table very low degree of positive or negative correlation exists which is not of much statistical significance from the point of view of analysis.

In order to further investigate the correlation cluster among the presented variable below figure can be represented which shows the variable correlation cluster for the various variable present in the task.
 correlation cluster for the given variables
From the above figure it can be clearly demonstrated that there is very high degree of relationship among the variable age and survival rate of the passenger while lowest degree of correlation exist among the variable name body identification number and survival rate of a passenger.

3.0 Building a model for predicting survival of passengers on the titanic uses a decision tree model in rattle data mining tool.

In order to predict the survival rate of passenger boarded in the Titanic several factors are of immense importance which would be used in order to develop the predictive model for the event. As evident from the decision tree below that there are mainly four variables which are helpful in determining the probability rate of a passenger to be surviving from the boat. These four variables are fare paid by the passenger, no of siblings boarded in the boat, no of parents/children boarded in the boat and port of embarkation.
 decision tree for determining survival rate of passengers
The four variables which are playing the important role in determining the survival chances of a passenger through Titanic boat have sub category in them which are of particular importance. First variable named passenger fare which is the fare paid by each of the passenger have two sub categories which are more than 25.467 and less than or equal to 25.467. Based on these two fare values chances for the passenger survival would vary as passengers paying higher fare would be having higher degree of probability for survival against the passengers paying lower fare and having lower degree of survival chances in the Titanic (Weiss and Indurkhya, 1998).

Similarly variable named Sibsp which provides the analysis of number of siblings boarded in the titanic boat hence giving chances of survival for the passengers. There would be adverse relation between the variable named Sibsp and survival rate of the passenger. With passengers having higher number of siblings boarded in the boat there would be lesser chances for the survival of the passengers. While passengers having lesser number of siblings boarded in the boat there would be higher probability for the survival of such passengers. For the current analysis a level of 2 siblings have been taken, for siblings more than 2 there would be lower chances of survival while for lesser than 2 siblings boarding the boat there would be more chances of survival for the passengers.

Third variable of consideration for the current analysis would be number of parents/children boarding the boat. There would be inverse proportionate relationship between the variable named parch and survival rate of the passengers. Parch variable represent the number of parents or children boarding the boat. The level of classification for the current variable has been taken as 1, i.e. with higher number of parents/children boarding the boat there would be lesser chances of survival for the passengers (Theodoridis and Koutroumbas, 2009).

Fourth variable which is of significance in the current analysis is port of embarkation as different ports are having varying degree of survival for the passengers. Some ports are considered to be safe while others are being treated as more risky in comparison to other ports embarked. There are three ports which have been considered for analysis of the survival rate among the passengers and these ports are Cherbourg, Queenstown and Southampton. Port of embarkation Southampton has higher probability of survival for the passengers in comparison to port of embarkation as Cherbourg and Queensland.

Decision tree made in order to provide predictive modeling for the passengers of Titanic has lead to development of different leaf nodes which would decide the chances of survival for any passenger based on the importance factors. As evident from the above figure that there are basically five nodes formed in the analyses which have been given number as node 3, 4, 7, 8 and 9. Each of the nodes represented in the above figure gives varying degree of prediction regarding survival of a passenger in the boat. Each node has been formed as a result of the combination through variables which are important in order to determine the survival rate of the passengers.

For each of the node given in the analysis probability of passenger survival can be determined through underlying variables and below description gives the analysis for each node with prediction regarding survival of the passengers:

Node 8: Node 8 has the prediction of highest chances for the survival of the passengers through underlying variables forming the node. There would be 15 sample data which are qualifying the node 8 and would be treated as survived as per the analysis carried out. Node 8 represents the passengers who have paid higher than 25.467 of passenger fare, having less than 2 siblings and more than 1 parent/children boarding in the boat. Survival chances for these passengers would be highest among all the passengers as given through predictive model developed in the context.

Node 4: Node 4 is the second node with highest probability of passenger survival as per the given variables of combination in order to define the survival rate of the passengers. Node 4 represents a set of 100 passengers who have paid passenger fair less than 25.467 and having port of embarkation as Southampton. 

Node 7: Node 7 represents a set of 65 passengers boarding the boat with specific characteristics such as passenger fare greater than 25.467, less than or equal to 2 siblings in the boat and less than 1 parent/children boarding the boat. After Node 8 and Node 4 passengers encircled in node 7 are having highest degree of survival chances in comparison to adjacent nodes.    

Node 3 & 9: Node 3 and contains 40 and 8 passengers respectively with their specific characteristics. Node 3 contains the passengers with lesser than 25.467 fare and port of embarkation as C or Q. While node 9 is formed with the passengers with passenger fare more than 25.467 and greater than 2 siblings in the boat.

4.0 Role of quality of data in order to ensure proper data warehouse architecture

Data warehouse architecture is used in order to help decision makers to take timely and better decision. Aim of a proper data warehouse architecture system is to provide the right information for business at right place in right time with cost effective solution so that better decision making can take place. Quality of information is very importance for the data warehouse architecture as it would be helpful in order to make accurate decisions with faster pace in comparison to the data warehouse with lower quality information contained into it (Murphy, 2011). Quality of information is important due to the reason that information is the only source which is helpful in taking decision.

Quality of data warehouse architecture is important in order to determine five key variables to the data warehouse architecture which are accessibility, interpretability, usefulness, believability and validation. Accessibility of the data warehouse architecture signifies data source, data warehouse design and processes used in data warehouse. Hence in order to have data accessibility which is one of the prime feature for any data warehousing architecture data quality should be improved so that there can be proper data design managed in order to arrange the data (Miner et al, 2009).

Interpretability of the data signifies the design of data warehouse, models & language used in order to interpret data, processing of queries and data warehousing processes used in order to analyze the collected data. In order to develop data warehouse architecture which can provide high level of interpretability for the decision maker it is essential to have high quality of information inputted in the data warehouse architecture. Inputting high quality information would be essential in order to manage the models and languages used in the data warehouses (Teisseire et al, 2007).

Usefulness of the data warehouse architecture signifies that there are updated policies, data warehouses evolution, data sources, data warehouse design and processes in order to manage the usefulness of the data. For a data warehouse architecture to become useful for the people making use of it, quality of data should be ensured in such a way that warehouse architecture can be properly managed and decision making process can take place much faster and accurately.

Believability of the data warehouse architecture signifies the reliability of the data so that data results outputted through warehouse architecture can be easily believed upon and can be used for the analysis purpose. Finally quality of data can be used in order to ensure the validation of data which comes through validating the data contained in warehouse structure with desired set of data. Validation of data can be ensured through high quality of data only (Kumar et al, 2005).

5.0 Analysis through pivot table

5.1 Creating report with store values in each quarter of 1998 across country and states and comment on the critical trends observed.

Graph below shows the key trend observed in the each quarter for the store values in different countries and state provinces. Among all the quarters in year 1998 quarter 1 has proved better for all the countries i.e. whether its Canada, Mexico or USA and in terms of all the store parameters i.e. store cost, store sales and unit sales for the store. Quarter 4 has remained with the lowest value of all store parameters for all the three countries and their state provinces. Quarter 2 and quarter 3 has remained in similar trend for all the parameters for three countries.
store values for state/provinces for all quarters  

 

Values

  

Row Labels

Store Cost

Store Sales

Unit Sales

1998

432565.7289

1079147.47

509987

Quarter 1

116512.6905

290873.18

137078

Canada

9576.6446

23881.13

11160

Mexico

47502.2264

118589.41

56133

USA

59433.8195

148402.64

69785

Quarter 2

115080.3318

287009.99

135745

Canada

11072.1808

27685

12885

Mexico

45683.9482

113830.59

54005

USA

58324.2028

145494.4

68855

Quarter 3

118322.14

295040.55

139412

Canada

10915.5866

27176.3

12966

Mexico

49267.9496

122706.05

57872

USA

58138.6038

145158.2

68574

Quarter 4

82650.5666

206223.75

97752

Canada

7768.1585

19303.03

9146

Mexico

30133.9206

75167.54

35904

USA

44748.4875

111753.18

52702

Grand Total

432565.7289

1079147.47

509987

Table 4: Showing the store values for state/provinces for all quarters

5.2 Creating report with country and state province in each product category and unit sales sub-product category of breakfast food and comment on key trends observed.

Table below provides the product category and sub-category for the unit sales in various countries and state/provinces for breakfast. From the below graph it can be clearly determine that Cereals is having the highest sales followed by the Waffles, pancake mix and pancake. Among the three selected countries USA has highest consumption of breakfast food followed by Mexico and least consumption in there is Canada. Among each kind of sub category USA has the highest consumption for cereals followed by waffles, pancakes and pancakes mix. Among all the sub categories and countries cereals in USA is the sub product category which is highly consumed by the customers.

Unit Sales

Column Labels

   

Row Labels

Canada

Mexico

USA

Grand Total

Breakfast Foods

1453

6594

8502

16549

Cereal

556

2585

3499

6640

Pancake Mix

112

701

844

1657

Pancakes

156

603

865

1624

Waffles

629

2705

3294

6628

Grand Total

1453

6594

8502

16549

Table 5: Showing product and subcategory distribution for breakfast
 product and subcategory distribution for breakfast
5.3 Creating report with state of Oregon & Washington, total sales & total sales value and comment on key trend observed.

Analysis below represents the unit sales and store sales values for the two provinces in United States of America. As shown in the table below that for the two states i.e. Oregon & Washington, unit sales is more than doubled for the Washington in comparison to unit sales for Oregon. Similar pattern has been observed for the store sales values as well for which total sales value is more than doubled for Washington in comparison to Oregon. Hence it can be observed from the analysis that consumption for Washington sales is double in comparison to Oregon which might be due to higher population.

Row Labels

Unit Sales

Store Sales

USA

186899

396294.93

OR

60612

128598.5

WA

126287

267696.43

Grand Total

186899

396294.93

   

total sales value and comment
5.4 Creating report with product categories of beer and wine and their product sub categories by order of unit sales and comment on key trend observed.

Table below provides the statistical analysis for the two product categories i.e. wine and bear along with their sub product categories in terms of their unit sales. From the analysis below it can be determined that wine has higher unit sales in comparison to the beer and sales of wine is more than twice as compared to the unit sales of beer. Among the product category of beer good is the brand which is highest selling while top measure is the lowest selling brand in beer product category. For wine product category Good is the highest selling brand while Walrus is the lowest selling brand in the wine category.

Row Labels

Unit Sales

Beer and Wine

13069

Beer

3359

Good

767

Pearl

725

Portsmouth

713

Top Measure

546

Walrus

608

Wine

9710

Good

2097

Pearl

2028

Portsmouth

1942

Top Measure

1883

Walrus

1760

Grand Total

13069

wine category.
6.0 Conclusion

Present research work has been carried out in order to analyze the survival rate of passengers boarded in Titanic with help of data collected.  There were more than 2200 people present on the ship when it meets with a collision with an ice berg. In this disaster a total 1400 plus people died and only 700 people were saved. Data analysis was done to assess the probability of saving for a passenger who was aboard on titanic through data mining.  There were many variables which impacted the survival probability of a passenger like their ticket category, age, sex, identification status etc.  Data analysis done through rattle data mining tool has been used with specific data mining technique such as principal component analysis to determine which factors are of ultimate importance, correlation to find out value of correlation between two given variables and decision tree to develop predictive modeling for the passengers of Titanic. From the principal component analysis it was found out that five variables which are of ultimate importance are number of sibling boarded, no of parents boarded, passenger fair, embarkation.

References

  • Murphy, C (2011). "Is Data Mining Free Speech?". InformationWeek (UMB): 12.
  • Nisbet, R; Elder, J; Miner, G (2009); Handbook of Statistical techniques & Data Mining Applications, Academic Press/Elsevier, ISBN 978-0-12-374765-5
  • Poncelet, P; Masseglia, F; and Teisseire, M (2007); "Data Mining Patterns: New Methods and Applications", Information Science Reference, ISBN 978-1-59904-162-9
  • Tan, P; Steinbach, M; and Kumar, V (2005); Introduction to Data Mining, ISBN 0-321-32136-7
  • Theodoridis, S; and Koutroumbas, K (2009); Pattern Recognition, 4th Edition, Academic Press, ISBN 978-1-59749-272-0
  • Weiss, S and Indurkhya, N (1998); Predictive Data Mining, Morgan Kaufmann