上QQ阅读APP看书，第一时间看更新

Visualizing variable distributions

Our first plot is a simple one and shows the proportion of votes by each RegionName. As you can see in the plot shown below, the London, North West, and West Midlands regions account for around 55 percent of the observations in the data.

Vote Proportion by Region

To create the plot, we need to create a table for the frequencies of each region in RegionName with the table() function, then we feed that to the prop.table() function, which computes the corresponding proportions, which in turn are used as heights for each bar.

We use the barplot() function to produce the plot, and we can specify some options, such as the title (main), the y axis label (ylab), and the color for the bars (col). As always, you can find out more about in the function's parameters with ? barplot:

table(data$RegionName) 
#> EE EM   L NE  NW SE SW  WM  Y
#> 94 20 210 32 134 79 23 133 78

prop.table(table(data$RegionName))
#>      EE      EM       L      NE      NW      SE      SW      WM       Y
#> 0.11706 0.02491 0.26152 0.03985 0.16687 0.09838 0.02864 0.16563 0.09714

barplot( 
    height = prop.table(table(data$RegionName)), 
    main = "Vote Proportion by Region", 
    ylab = "Frequency", 
    col = "white"
)

Our next plot, shown below, is a little more eye-catching. Each point represents a ward observation, and it shows the Proportion of Leave votes for each ward, arranged in vertical lines corresponding to RegionName and colored by the proportion of white population for each ward. As you can see, we have another interesting finding; it seems that the more persified a ward's population is (seen in the darker points), the more likely it is for the ward to vote in favor of remaining in the EU (a lower Proportion value).

Proportion by RegionName and White Population Percentage

To create the plot, we need to load the ggplot2 and viridis packages; the first one will be used to create the actual plot, while the second one will be used to color the points with a scientifically interesting color palette called Viridis (it comes from color perception research done by Nathaniel Smith and Stéfan van der Walt, http://bids.github.io/colormap/). The details of the ggplot2 syntax will be explained in Chapter 4, Simulating Sales Data and Working with Databases, but for now, all you need to know is that the function receives as a first parameter the data frame with the data that will be used for the plot, and as a second parameter an aesthetics object (aes), created with the aes() function, which in turn can receive parameters for the variable that should be used in the x axis, y axis, and color. After that, we add a points layer with the geom_points() function, and the Viridis color palette with the scale_color_viridis() function. Notice how we are adding plot objects while we work with ggplot2. This is a very convenient feature that provides a lot of power and flexibility. Finally, we show the plot with the print() function (in R, some functions used for plotting immediately show the plot (for example, barplot), while others return a plot object (for example, ggplot2) and need to be printed explicitly):

library(ggplot2)
library(viridis)

plot <- ggplot(data, aes(x = RegionName, y = Proportion, color = White))
plot <- plot + geom_point() + scale_color_viridis()
print(plot)

The next set of plots, shown below, display histograms for the NoQuals, L4Quals_plus, and AdultMeanAge variables. As you can see, the NoQuals variable appears to be normally distributed, but the L4Quals_plus and AdultMeanAge variables seemed to be skewed towards the left and right, correspondingly. These tell us that most people in the sample don't have high education levels and are past 45 years of age.

Histogram for NoQuals, L4Quals_plus, and AdultMeanAge

Creating these plots is simple enough; you just need to pass the variable that will be used for the histogram into the hist() function, and optionally specify a title and x axis label for the plots (which we leave empty, as the information is already in the plot's title).

For the book, we arranged plots in such a way that their spacing and understanding is efficient, but when you create the plots using the code shown, you'll see them one by one. There are ways to group various plots together, but we'll look at them in Chapter 4, Simulating Sales Data and Working with Databases).

Let's have a look at the following code:

hist(data$NoQuals, main = "Histogram for NoQuals", xlab = "")
hist(data$L4Quals_plus, main = "Histogram for L4Quals_plus", xlab = "")
hist(data$AdultMeanAge, main = "Histogram for AdultMeanAge", xlab ="")

Now that we understand a bit more about the distribution of the NoQuals, L4Quals_plus, and AdultMeanAge variables, we will see their joint-distribution in the scatter plots shown below. We can see how these scatter plots resemble the histograms by comparing the x axis and y axis in the scatter plots to the corresponding x axis in the histograms, and comparing the frequency (height) in the histograms with the point density in the scatter plots.

Scatter plots for NoQuals, L4Quals_plus vs AdultMeanAge

We find a slight relation that shows that the older the people, the lower the levels of education they have. This can be interpreted in a number of ways, but we'll leave that as an exercise, to keep focus on the programming, not the statistics. Creating these scatter plots is also very simple. Just send the x and y variables to the plot() function, and optionally specify labels for the axes.

plot(x = data$NoQuals, y = data$AdultMeanAge, ylab = "AdultMeanAge", xlab = "NoQuals")
plot(x = data$L4Quals_plus, y = data$AdultMeanAge, ylab = "AdultMeanAge", xlab = "L4Quals_plus")