CIS8008 Business Intelligence Rattle Data Mining Assignment

CIS8008 Business Intelligence Rattle Data Mining Assignment

CIS8008 Business Intelligence Rattle Data Mining Assignment

1.0    Introduction

Sinking of the Titanic occurred on the midnight of 14th April in 1912 and with over 2200 people on the board, it struck an Iceberg and resulted death in 1500 people. It took 2 hours and forty minutes to capsize though it was considered Titanic was unsinkable and had too few lifeboats for everyone. This was one of the major disasters which occurred in 20th century and was greeted with global outrage due to operational and regulatory failures. In this rattle data mining assignment help, we would analyze in detail the survival rate of the passengers on board and their chances to survive in those conditions and by building a predictive model which could establish possibility of survival of the passengers (Nisbet, 2006). While we develop predictive model, it will based on variables of significance which would make certain high survival chances and higher survival rate of the passengers on the board. The data will be analysed using special techniques in research management such as decision tree, correlation analysis and principal component analysis. The data used in determining the chances of survival would be based on the passenger’s demographic information such as age, sex, class they boarded, number of family members, passenger fare etc. The main objective of developing the model and analysing is it would determine the survival chances of the passengers and after decision tree analysis predictive modelling is being carried out which would help in survival possibility in that situations considering the environmental conditions on the cold winter night (Liu, 2007).

2.0    Identifying Variables of Significance

2.0	Identifying Variables of Significance

Principal Component techniques is used in determine the variables of significance. These variables are those variables which ensure high survival chances for the passengers on board of the Titanic Ship. These significant variables can be identified through data collected of passengers. In the above figure, the Variance for the five key components is 99.7 % or nearly 100 %. The variance explained by principal combing all the key variables is 99.7 % and the below figure explains the standard deviation along with the variance for each of five components.

five components

The above figure makes uses of the below data and the data explains the PC1 explains 0.283 or in other words 28.3 % variances in the data and similarly PC2 the next significant component explains 25.5 % variance in the data. In all, all five Principal components combined explain 100 % variance in the data.

Principal Component

PC1

PC2

PC3

PC4

PC5

Standard Deviation

1.19

1.13

1.019

0.867

0.717

Variance

0.283

0.255

0.207

0.15

0.102

Cumm. Prop.

0.283

0.538

0.746

0.897

1

Rotation with each of the five principal components can be done and scatter plot is drawn for principal component analysis. 

Factor 

PC1

PC2

PC3

PC4

PC5

Age

-0.455

0.18

-0.579

0.641

-0.115

Sibsp

-0.572

0.222

0.532

0.115

0.57

Parch

-0.051

0.761

0.268

-0.112

-0.576

Fare

-0.555

-0.037

-0.383

-0.735

-0.034

Body

0.392

0.579

-0.401

-0.143

0.571

principal components

There are five significant variables namely Age, Sibsp, Parch, Fare and Body Identification Number and significant variable rotation with five principal components is established.

Significant Variables

Age: This variable signifies the age of the passenger on the board of Titanic Ship and as the age of the passenger increased it decreased the chances of survival for the passenger and thus the younger passenger on board of Titanic Ship had the high survival rate.

Fare: This is the passenger fare which was paid by the passenger on boarding the Titanic ship. Different classes have different passengers fare and the higher the passengers fare there was higher chances of survival.

Body Identification Number: As we see the above charts, the survival rates do not have any direct correlation with Body Identification number. Though people having body identification number do have the chances of survival but it was not established to have the correlation amongst the two. 

Sibsp: It is the number of siblings per couple and is the most significant variable. As the number of siblings increased on the board per couple, the chances of survival would be low. The less the value of Sibsp variable, there are higher chances the passenger and its siblings would survive.

Parch: It is the number of parents per children and with higher the number of this variable the chances of survival would be low. This would be also an important variable for establishing the chances of survival on Titanic board.

Below Figure shows the correlation coefficients amongst each of variables. The five variables are closely related to each other. There is positive correlation between Parch variable i.e Parch and Sibsp. There is negative correlation among variables such as Age and Sibsp, Parch and Age and Body and Fare. There is clear establishment of correlation amongst Fare and people survived. Hence, it can be said, people with higher passenger fare have high survival rate. There is no clear indication Body Identification Number and Survival Chances and shown in the table as not applicable.

Factor

Age

Sibsp

Parch

Fare

Survived

Body

Age

1

-0.267

-0.147

0.199

-0.052

0.125

Sibsp

-0.267

1

0.36

0.151

-0.016

-0.129

Parch

-0.147

0.36

1

0.176

0.078

0.067

Fare

0.199

0.151

0.176

1

0.233

-0.044

Survived

-0.052

-0.016

0.078

0.233

1

NA

Body

0.125

-0.129

0.067

-0.044

NA

1

The below figure shown clearly represents the correlation cluster among variables such as Age, Parch, Sibsp Fare and Body with the Survival rate. A correlation cluster is drawn which represents variable correlation cluster of the above variables. The figure shows there is significant correlation between fare and survival chances, though the age has highest correlation with survival rate and body has least survival chances. Parch and Sibsp also had some degree of correlation with the survival chances.
variable correlation cluster

3.0    Developing Predictive Model for Passenger Survival

The predictive model is developed to predict survival of people on board of Titanic and it is found there are four significant variables. A decision tree determines there four significant variables which are used in rattle data mining model and which helps in developing the model of predicting the survival of people on the board.

3.0	Developing Predictive Model for Passenger Survival

The above figure shows the decision tree which helps in survival of passengers and four variables. The first variable is Fare of the passenger which has two categories, one being above 25.467 and other being below 25.467. The higher the passenger paid the fare to board the Titanic Ship, the higher the chances of survival and the fare of the passengers depends on class of the passenger which they boarded. Number of siblings is another significant variable found from the decision tree analysis and higher the number of siblings on the board of Titanic Ship there were less chances of their survival. The variable Number of siblings has been divided into two categories, one being greater than 2 and other being less than 2. The number of siblings with more than 2 has lesser chances of survival and the number of siblings with less than two has more chances of survival. Hence, the Sibsp is also an important and significant variable which determines the survival rates apart from Fare variable (Resig and Teredesai, 2004).

Get more information Data Systems Assignment Help

The third variable being the number of parents per children and it defines lesser the number of parents per children, the higher the chances of survival of the passenger. It has indirect relationship with the survival rate. The Parch variable is pegged at one and the variable has also been divided in to two categories one being greater than one and other being less than one. Lastly, the fourth variable in the decision tree analysis is embarkation port and it depends on the port of embarkation which defines the chances of survival and its degree. It depends on the port of embarkation to another which varies with the risk. The three ports which are Cherbourg, Queenstown and Southampton are considered and its affect in the decision tree analysis on the survival rate.

As we see in the figure above, the port of Southampton has higher chances of survival as compared to the other two ports Queenstown and Cherbourg. After the variables have been found, the decision tree helps in building the model for predicting the survival rate. There are different nodes and based on these nodes chances of survival of the passengers on the Titanic board are determined and the nodes 3, 4, 7, 8, 9 are formed and has varying degree of prediction of the survival rate of the passengers on the Titanic Ship. We analyze each of these nodes and determine the probability of survival of the passengers on board as each node is imperative and important with respect to varying degree of probability of survival (Paul et al, 2011).

Analysing Different Nodes 

Node 8: Node 8 is represented by predicting chances of each of variables in the decision tree and the Fare with than 25.467 while also having less than 2 siblings and more than one parent per children. The survival rate of the passengers on Titanic is highest for node 8 as compared to all other nodes. The Node is very significant as it has highest chances of survival and has 15 sample data which meet the criteria for node 8 as we develop a decision tree and build the predicting model for the survival chances of the passengers on the Titanic Board.

Node 4: After Node 8, Node 4 is the second most significant variable which has second highest probability of survival of the people on the Titanic. This node represents the passengers who have paid less than 25.467 and also who have embarked at Southampton. This represents the importance of Southampton port. With the port of embarkation as Southampton people have the highest chances of survival as compared to all other ports.

Node 7: Node 7 is the third most significant variable in terms of survival chances of the passengers on the Titanic. Node 7 predicts the survival of the passengers on the Titanic and represents passengers who boarded the Titanic having Fare more than 25.467, having number of siblings less than or equal to two and also number of parents per children less than or equal to one.

Node 3: With n = 40, node 3 is formed with 40 passengers on the Titanic and is determined as passengers with less than or equal to 25.467 and the passengers who embarked at port Cherbourg or Queenstown and is also a node though less significant is survival chances that means it has lesser chances of survival as compared to Node 8, 4 and 7.

Node 9: Node 9 is formed with passengers who have Passenger Fare greater than 25.467 and have siblings greater than 2, the number of passengers node 9 has n = 8 and is the less chances of survival as compared to other nodes.

4.0    Significance of Quality of Data

The right type information along with quality of data is necessary in order to make certain the make the robust solution to a problem. It is imperative the data represent real world construct with a consistency in the data and Data Warehouse Architecture is used to so as to ensure the highest quality of data. Building a solution around helps in effective decision making (Phillips et al, 2006). Decision makingcomes directly from quality of data and hence quality of information so it is imperative to understand the data quality. Data Warehouse Architecture helps in determining few significant variables such as accessibility, usefulness, interpretability, believability and validation. Accessibility deals with the designing of the data and the data source and this feature explains quality of data is imperative in order to have the proper data design (Wendel et al, 2002). It is also the most significant feature of Data warehouse architecture and explains the importance of improving the data quality and which helps in better data design. The second feature which deals with usefulness of the data warehouse architecture with updated policies explains the usefulness of the data in making quick and accurate decisions and processes are architecture should be designed in a way accordingly.

The next feature is interpretability that deals with the interpretation of the data which is passed as input to the data warehouse architecture (Guo et al, 2009). The data is analysed and interpreted that forms part of decision making. This feature depends on various models and languages used which help in analysing the data and to ensure data is analysed accurately and effectively it is necessary the input data has highest quality of data. The reliability of the data is another key feature which is termed as believability and it signifies the output data from the Data warehouse architecture can be easily relied upon and can be taken forward for data analysis and finally helping in decision making to the users. Lastly, data validation included in warehouse structure is also necessary for quality of data and for analysing the data (Talia & Trunfio, 2010).

5.0    Analysis Through Pivot Table

5.1    Creating report with store values in each quarter of 1998 across country and states and comment on the critical trends observed.

each quarter of 1998 across country and states

The above graph shows the value of Store Sales, Store Cost and Unit Sales consolidated for different regions and segregated on different quarters of 1998. Quarter 3 has performed better in terms of Store Sales in the year 1998 when all the states such as Canada, Mexico and USA sales are combined. Store cost and Unit Sales was highest in Quarter 1 with total value of 116512.7 and 137078 respectively in all the states combined. Store Sales were least in Quarter 4 of the year 1998. Below is the table which explains the Store Cost, Store Sales and Unit Sales in tabular form.

1998

Store Cost

Store Sales

Unit Sales

Canada

9576.6446

23881.13

11160

Mexico

47502.2264

118589.41

56133

USA

59433.8195

148402.64

69785

Quarter 1 (1998)

116512.6905

290873.18

137078

Canada

11072.1808

27685

12885

Mexico

45683.9482

113830.59

54005

USA

58324.2028

145494.4

68855

Quarter 2 (1998)

115080.3318

287009.99

135745

Canada

10915.5866

27176.3

12966

Mexico

49267.9496

122706.05

57872

USA

58138.6038

145158.2

68574

Quarter 3 (1998)

118322.14

295040.55

139412

Canada

7768.1585

19303.03

9146

Mexico

30133.9206

75167.54

35904

USA

44748.4875

111753.18

52702

Quarter 4 (1998)

82650.5666

206223.75

97752

   Grand Total

432565.7289

1079147.47

509987

5.2 Creating report with country and state province in each product category and unit sales sub-product category of breakfast food and comment on key trends observed.

The below graph explains the unit sales of breakfast food in the three states namely Cereal, Mexico and USA. Breafast foods have the highest sales in USA and it is more than combines Mexico and Canada. The below graph also explains the demand in USA for each of the breakfast items namely Pancake Mix, Cereal, Waffles and Pancakes. The total unit sales of breakfast foods is 8502 in USA and it contributes about 51 % of total sales in USA followed by 40 % in Mexico and 9 % in Canada. Among the different breakfast foods, Cereal and Waffles are consumed to the maximum and that too in USA.

the different breakfast foods

Below Pie Graph shows the total Breakfast Foods unit sales in Canada, Mexico and USA and as discussed USA unit sales stands at 51 % which is more than combining the other two states.

Pie Graph

5.3 Creating report with state of Oregon & Washington, total sales & total sales value and comment on key trend observed.

The figure below shows the unit sales and store sales in Oregon and Washington regions of Unites States of America. The unit sales and store sales in Washington Province is more than double than Oregon Province. It can be inferred from the graph below that demand in Washington is more than double as compared to Oregon. As the demand is double so does the consumption pattern as demand is in proportional to consumption. The reason can be because the Washington has higher population as compared to Oregon and as the population is higher there is higher demand and higher consumption and thus higher sales.

total sales value and comment

The below table represents the unit sales and store sales in the two states of America namely Oregon and Washington with Washington numbers are far higher than Oregon.   

 

Unit Sales

Store Sales

OR

60612

128598.5

WA

126287

267696.43

USA

186899

396294.93

5.4 Creating report with product categories of beer and wine and their product sub categories by order of unit sales and comment on key trend observed.

The graph below shows the total sales of Wine is 74 % and Beer is 26 % and it can be inferred that beer consumption is nearly one- third of wine consumption. Also, the below table explains amongst Beer, ‘Good’ has highest number of sales and Top Measure has the lowest sales. While in the Wine category, Good is the highest selling brand with highest sales and Walrus is the lowest selling brand in the Wine Category. The demand of Wine is far higher as compared to Beer and Good and Pearl brand of wine are highest selling brands of wine.

unit sales

 

Unit Sales

Good

767

Pearl

725

Portsmouth

713

Top Measure

546

Walrus

608

Beer

3359

Good

2097

Pearl

2028

Portsmouth

1942

Top Measure

1883

Walrus

1760

Wine

9710

Beer & Wine

13069

6.0    Conclusion

Titanic ship had more than 2200 people on board and over 1500 drowned and only over 700 could be saved when the ship capsized on 14th April 1912. This study is carried out to understand and analyse the survival chances of passengers on Titanic. The data is collected and analysed by determining variables of significance namely Age, number of siblings, number of parents per children, body identification number and the passenger fare which depends on the class through which passenger travelled.

Data is analysed through rattle data mining tool and the techniques used is principal component. The correlation coefficients is determined which helps in identifying the relationship among variables. A predictive model is also developed through decision tree analysis and the significant variables determined are number of siblings, port of embarkation, passenger fare and number of parents on the Titanic ship. This Model helps in determining the chances of survival of passengers based on above variables.

References

  • Haag, Stephen; Cummings, Maeve; Phillips, Amy (2006). Management Information Systems for the information age. Toronto: McGraw-Hill Ryerson. p. 28. ISBN 0-07-095569-7. OCLC 63194770.
  • Ghanem, Moustafa; Guo, Yike; Rowe, Anthony; Wendel, Patrick (2002). "Grid-based knowledge discovery services for high throughput informatics". Proceedings 11th IEEE International Symposium on High Performance Distributed Computing. p. 416. doi:10.1109/HPDC.2002.1029946. ISBN 0-7695-1686-6. edit
  • Ghanem, Moustafa; Curcin, Vasa; Wendel, Patrick; Guo, Yike (2009). "Building and Using Analytical Workflows in Discovery Net". Data Mining Techniques in Grid Computing Environments. p. 119. doi:10.1002/9780470699904.ch8. ISBN 9780470699904. edit