# CIS8008 Rattle Data Mining Assignment Help

## 1.0 Introduction

Sinking of Titanic boat was one of the most famous phenomenons which have created one of the most memorable historic events. There were so many passengers travelling through Titanic boat and some of them could survive while other lost their life. In present paper we would try to analyze survival rate of the passengers travelling through Titanic boat, factors which were of immense importance in determining their survival and developing a predictive model in order to determine chances of survival for the passengers based on specific variables.

Present paper would make usage of data collected for passengers in form of their age, sex, pclass, number of siblings in boat, number of parents in boat, body identification number and passenger fair etc. Rattle data mining tool would be used in order to analyze the data with specific tools such as correlation, decision tree and principal component analysis etc. These tools would be helpful in order to find out the variable of importance for the passengers which are helpful to ensure higher survival rate for the passenger. Further predictive modeling would be done based on the decision tree so that a model can be developed in order to predict whether under given circumstance a particular passenger would have survived or not.

### Key components for survival rate

Present section would identify the key elements which are of immense importance in order to find out key components of defining the survival rate for the passengers. Table below provides the summary statistics for the variables included in the business analysisfor the data collected in order to find survival rate for the passengers.

 Parameter Age Sibsp Parch Body fare Pclass Min 0.33 0.0 0.0 1 0.0 1 1st quartile 21 0.0 0.0 80 7.89 2 Median 28 0.0 0.0 169 14.45 3 Mean 29.81 0.49 0.37 168 32.07 2.3 3rd quartile 38 1.00 0.0 261 30.50 3 Max 80 8.0 9.0 328 512 3 Na 186 0.0 0.0 827 1 0.0

Table 1: Showing summary statistics for the key variables

As shown in table above that there are six variables for which summary statistics has been made i.e. age, sibsp, parch, body, fare and pclass. For age variables 25% of the people have age below then 21 years while 25% of the passengers were having age more than 38 years. There were three passenger classes in the boat i.e. 1, 2 and 3. Rests of the variables are self explanatory.

 Count of Sr. no Survived PCLASS 0 1 Grand Total 1 123 200 323 2 158 119 277 3 528 181 709 Grand Total 809 500 1309

Table 2: Showing cross tab for survival rate of the passenger against the passenger class

As shown in the table above and chart below that passenger in class 1 were having highest survival rate followed by class 3 and class 2. In terms of probability class 1 was having highest probability for passenger survival.

Figure 1: Showing cross tab for survival rate of passengers against passenger class

Figure 2: Showing principal component analysis for variables for Titanic sinking

From the figure 2 and table 3 it can be clarified that there are in total 5 factors which are responsible for variance in data. While PC 1 is responsible for 28.3% variance in data, PC 2 responsible for 25.5% variance in data, PC 3 responsible for 20.7% variance in data, PC 4 responsible for 15% variance and PC 5 are only able to explain 10% of total variance in data.

 PC PC1 PC2 PC3 PC4 PC5 SD 1.190 1.130 1.019 0.867 0.717 Variance 0.283 0.255 0.207 0.150 0.102 Cumm. Prop. 0.283 0.538 0.746 0.897 1.000 Factor PC1 PC2 PC3 PC4 PC5 Age -0.455 0.180 -0.579 0.641 -0.115 Sibsp -0.572 0.222 0.532 0.115 0.570 Parch -0.051 0.761 0.268 -0.112 -0.576 Fare -0.555 -0.037 -0.383 -0.735 -0.034 Body 0.392 0.579 -0.401 -0.143 0.571

Table 3: Showing variance in data for the five principal components

These five principal components responsible for variance in data are age of the passengers, no of siblings aboard, no of parents/children aboard, fare paid by the passengers and port of embarkation. From the above table priority for each principal component can be found and as explained below in their decreasing priority level:

1. No of siblings/spouse aboard: No of siblings aboard is the most important principal component which explains more than 28% variability in the data set. With higher number of siblings aboard there would be lower survival rate as passenger would be having their responsibility as well.
2. No of parents/children aboard: This variable is second principal component which explains 25% of the variance in the entire data set. This variable has adverse impact on the survival rate as higher the number of parents/children aboard there would be lower survival rate for the passenger.
3. Age of the passenger: Age is the third principal component factor which decides the survival rate for the passengers and explains 20% of the variability in entire data set. With higher age of the passenger there would be lesser survival rate.
4. Passenger fare: Passenger fare is the fourth most important variable and explains more than 15% of the variability in the data. With higher fare paid by the passenger there would be higher survival rate for the passengers.
5. Port ofembarkation: Port of embarkation is the fifth principal component which explains 10% of the variance in data. There are three port of embarkation i.e. C, S & Q and based on the specific category for port of embarkation survival rate can be ensured. From the data it can be make out that Port C has the highest survival rate in terms of percentage in comparison to other two port of embarkation.

Figure 3: Showing correlation among the principal components

 Factor Age Sibsp Parch Fare Survived Body Age 1.000 -0.267 -0.147 0.199 -0.052 0.125 Sibsp -0.267 1.000 0.360 0.151 -0.016 -0.129 Parch -0.147 0.360 1.000 0.176 0.078 0.0670 Fare 0.199 0.151 0.176 1.000 0.233 -0.044 Survived -0.052 -0.016 0.078 0.233 1.000 NA Body 0.125 -0.129 0.067 -0.044 NA 1.000

Table 4: Showing correlation data among the principal components identified

From the table 4 and figure 3 above it can be justified that variables parch and sibsp are having high degree of positive correlation which means that such passengers were travelling with their complete family. Further variable fare and survival were also having positive correlation means that higher the fare paid by the passenger higher was the chances for survival. Further there was high degree of negative correlation among the variables named age and sibsp which means that higher the number of siblings aboard lesser was the age of the passenger.

Figure 4: Showing correlation cluster for the different variables

From the figure of correlation cluster it can be suggested that there is high degree of correlation between the two variables named body identification number and survival rate of the passengers. After that variables parch and sibsp are having high degree of correlation which was even displayed through the correlation table earlier as well.

### Development of predictive modeling for the survival rate determination

In order to develop decision making regarding the importance variables contributing to the determination of survival for the passengers decision tree can be framed which would provide variables of importance.

As shown in the figure above that with help of rattle data mining tool conditional decision tree can be plotted which shows survival rate for the passengers boarded in the Titanic boat based on the several variables of importance such as the fare paid by the passenger, parents/children boarded in the boat, no of sibling boarded in the boar and embarkation. All these variables are of prime importance in order to make decision regarding survival of the passengers which is also justified through principal component analysis method. As shown in the decision tree above that there is one root node and some intermediate nodes and finally there are leaf nodes which shows the chances of survival for the passengers. Each leaf node made in the decision tree represents a set of passengers showing similar character tics. There are in total 9 nodes which have been formed with the conditional basis with the variables of importance (Kantardzic, 2003).

Root node shows the amount of fare paid by each passenger and based on the condition for fair paid by the passenger survival rate can be divided among the two different nodes. The first condition for division among the passengers is based on the fair paid with amount 25.467. Node 2 consists of the passengers who have paid either equal or less than 25.467 of fair for the boarding while node 5 consist of the passengers who have paid fair more than 25.467. Sub nodes under each node i.e. node 2 and 5 also follows the similar trend for the passenger fair.

Second level of bifurcation among the nodes has been done based on the embarkation as there are three levels of embarkation i.e. C, S and Q. Passengers having embarkation as C or Q are classified under node 3 while passengers having embarkation as S are classified under node 4. Similarly from the node 5 division among the further node has been done based on the factor that variable sibsp has value less than or equal 2 or sibsp value is higher than 2. For passengers having sibsp value higher than 2 node 9 is represented while for passengers having sibsp value less or equal 2 node 6 represents such passengers with further bifurcation based on the parch values. For the parch values less than or equal to 1 node 7 represents the passenger class while for the passengers having parch values more than 1 node 8 represents such set of passengers. Actual survival rate for the passengers can be predicted through the five leaf nodes formed in the decision tree while intermediate nodes are just used in order to make bifurcation among the variables values and survival rate can’t be predicted through intermediate nodes.

For each leaf node survival rate among the passengers is shown through 1 to 3 point scale with higher the values shows higher probability of the passenger to survive in the boat and vice a versa. Among all the five nodes i.e. node 3, 4, 7, 8 and 9, node 8 shows the highest level of survival rate among the passengers with total of 15 passengers showing the similar exhibits. Hence from the predictive modeling formed with help of data it can be said that passengers having fare more than 25.467, sibsp <=2 and parch>1 are having highest survival rate among the other passengers boarded in the boat.

After leaf node 8, node 7 shows the high rate of survival among the passengers with the passengers exhibits as the fare >25.467, sibsp<=2 and parch<=1. Hence node 7 which represents a set of 65 passengers are having lesser survival rate than node 8 while among other passengers these passengers enjoys a much higher survival rate for the exhibits shown by these passengers. Following the similar trend node 4 is having the lesser survival rate in comparison to the node 8 & 7 with characteristics as the fare <=25.467, embarkation as S. Node 4 represents a set of 100 passengers boarding the boat and represents high level of survival rate for these passengers.

While node 9 and node 3 are having a set of 8 and 40 passengers respectively showing similar characteristics and there is very low survival rate for these passengers based on their characteristics as determined through principal components. Hence at each node level there is some decision making involved which can be justified with the help of conditional decision tree and for a particular node conditions can be examined through the four important variables as determined in the decision tree.

### Role of data quality in data warehouse architecture

Data quality is of immense importance for the data warehouse architecture and defines the success of overall data warehouse architecture developed in the organization. Hence it is important to ensure that data quality should be as per standards and requirements of the data warehouse architecture (Ye, 2003). Data quality has its role in data warehouse as it enhance efficiency of the data warehouse, avoid duplication of work & efforts, saves cost and enhance decision speed for the management using data warehouse reports so as to make their day to day decisions.

Some of the key characteristics which a quality data set must include to form the data warehouse architecture in the organization structure include usefulness, validation, believability, accessibility and interpretability. These key qualities of data can be explained as given below:

• Usefulness: Data usefulness can be ensured through checking the data for the updated version of the policies and procedure set for the data warehouse requirements. A set of data which are not as per the latest quality norms or policy would not prove useful for the data warehouse architecture established in the organization (Câmara and Raper, 1999).
• Validity: Validity of data can be ensured through replicating same set of data again and again so that data quality does not get affected and remain valid for the requirements of users and data warehouse architecture as well.
• Believability: Data believability defines the reliability of data and reliability comes through minimal error rate in the given data. A data set having lowest level of error in it would be believable.
• Accessibility: Data accessibility should be ensured that user can access the data whenever there is requirement for the data. An efficient data warehouse system should have 24 hour accessibility for the data (Robert, 2006).
• Interpretability: Data interpretability is very important in order to understand that data is into the form which makes sense for the decision makers and decision can be taken based on the data represented through data warehouse architecture.

Role of data quality in data warehouse is through the various benefits offered by quality data for the data warehouse architecture such as the non duplication of work, faster decision power, better understanding, no missing or garbage values and efficiency enhancement for the organization. A quality data would avoid duplication of work as data captured one time can be used for generation of several reports and making interpretation for the data (Ralf and Markus, 2011).

Further with high quality of data it becomes easier for the decision maker to make quick decision based on the better understanding obtained through the continuous data containing no missing values of garbage values. Efficiency enhancement is the major role which a quality data can play for the organization as it saves the time, effort and cost for the organization by implementing high quality of data in data warehouse architecture.

#### 3.1 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists the store values for each quarter of 1998 across country and state province and comment on key trends and patterns in this report

From the pivot table analysis through data presented regarding various state/province, their sales figure for each quarter below graph can be plotted. Key trend from the graph below shows that for quarter 3 all three parameters were high i.e. store cost, store sales and unit sales of the products. For Canada unlike general trend highest values of store cost, store sales and unit sales have been observed into quarter 3.  While for Mexico quarter 3 was having highest figures for store cost, store sales and unit sales but for USA quarter 1 was having highest values.

 Values Row Labels Store Cost Store Sales Unit Sales 1998 432565.7289 1079147.47 509987 Quarter 1 116512.6905 290873.18 137078 Canada 9576.6446 23881.13 11160 Mexico 47502.2264 118589.41 56133 USA 59433.8195 148402.64 69785 Quarter 2 115080.3318 287009.99 135745 Canada 11072.1808 27685 12885 Mexico 45683.9482 113830.59 54005 USA 58324.2028 145494.4 68855 Quarter 3 118322.14 295040.55 139412 Canada 10915.5866 27176.3 12966 Mexico 49267.9496 122706.05 57872 USA 58138.6038 145158.2 68574 Quarter 4 82650.5666 206223.75 97752 Canada 7768.1585 19303.03 9146 Mexico 30133.9206 75167.54 35904 USA 44748.4875 111753.18 52702 Grand Total 432565.7289 1079147.47 509987

#### 3.2 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for country and state province across product category and unit sales sub product category of breakfast foods and comment on key trends and patterns in this report

 Unit Sales Column Labels Row Labels Canada Mexico USA Grand Total Breakfast Foods 1453 6594 8502 16549 Cereal 556 2585 3499 6640 Pancake Mix 112 701 844 1657 Pancakes 156 603 865 1624 Waffles 629 2705 3294 6628 Grand Total 1453 6594 8502 16549

From the table above it can be stated that among the three countries maximum of all products i.e. cereals, pancake mix, pancake and waffles are sold into USA. Hence overall consumption for all the products is highest in USA as compared to other two countries. Further it has been observed that cereal is the most sold food category followed by waffles, pancake mix and pancake. For all countries product consumption pattern is observed similar though there are differences in consumption patter for the individual countries.

#### 3.3 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for the states of Oregon (OR) and Washington (WA) the total sales and total sales value in dollar terms and comment on key trends and patterns in this report

 Row Labels Unit Sales Store Sales USA 186899 396294.93 OR 60612 128598.5 WA 126287 267696.43 Grand Total 186899 396294.93

From the table above it can be observed that unit sales are more for the Washington state as compared to Oregon. Further it can be identified that similar trends are observed for the unit sales and store sales value because of the reason that prices for the particular unit are same hence store values and unit sales values would be in linear proportion due to which same kind of pattern has been observed for the two states. Further consumption levels are more than doubled in Washington as compared to Oregon hence store sales value are also doubled.

#### 3.4 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists by product categories of beer and wine and their product sub categories, by order of unit sales and comment on key trends and patterns in this report

Table below gives the key trend for the beer and wine for the various brands of both the products. It has been found out that unit sales levels are much more for wine as compared to beer levels. Further among the beer sub categories it has been found out that “Good” is the most promising brand followed by the pearl, Portsmouth, walrus and top measure. Similar kind of trend is also observed for wine sub categories where in “Good” is the most sold brand followed by pearl, Portsmouth, top measure and walrus.

 Row Labels Unit Sales Beer and Wine 13069 Beer 3359 Good 767 Pearl 725 Portsmouth 713 Top Measure 546 Walrus 608 Wine 9710 Good 2097 Pearl 2028 Portsmouth 1942 Top Measure 1883 Walrus 1760 Grand Total 13069

#### Conclusion

This study explains the survival chances of passengers on the Titanic Ship and significant variables were identified such as Age, Parch, Sibsp and Fare. Using principal component information techniques and using these significant variables a predictive model is developed which helps in understanding the survival chances of passengers on the Titanic.

Rattle Data mining tool is used for this and various correlation coefficients are determined between different variables which helps us in comprehending and analyzing the survival rates of passengers on the Titanic Ship. The Decision tree is developed based on significant variables and different nodes are formed, different nodes explain different survival chances of passengers on the Titanic.

Oz Assignment Help is the best assignment help provider in Australia. Our online assignment writing help Australia  isespecially dedicated for the students studying in all Australian colleges and universities. Order Now