The Statistics and Calculus with Python Workshop
上QQ阅读APP看书,第一时间看更新

Inferential Statistics

Unlike descriptive statistics, where our goal is to describe various characteristics of a dataset using specific quantities, with inferential statistics, we'd like to perform a particular statistical modeling process on our dataset so that we can infer further information, either about the dataset itself or even about unseen data points that are from the same population.

In this section, we will go through a number of different methods of inferential statistics. From these discussions, we will see that each method is designed for specific data and situations, and it is the responsibility of the statistician or machine learning engineer to appropriately apply them.

The first method that we will discuss is one of the most fundamental in classical statistics: t-tests.

T-Tests

In general, t-tests (also known as Student's t-tests) are used to compare two mean (average) statistics and conclude whether they are different enough from each other. The main application of a t-test is comparing the effect of an event (for example, an experimental drug, an exercise routine, and so on) on a population against a controlled group. If the means are different enough (we call this statistically significant), then we have good reason to believe in the effect of the given event.

There are three main types of t-tests in statistics: independent samples t-tests (used to compare the means of two independent samples), paired sample t-tests (used to compare the means of the same group at different times), and one-sample t-tests (used to compare the mean of one group with a predetermined mean).

The general workflow of a t-test is to first declare the null hypothesis that the two means are indeed equal and then consider the output of the t-test, which is the corresponding p-value. If the p-value is larger than a fixed threshold (usually, 0.05 is chosen), then we cannot reject the null hypothesis. If, on the other hand, the p-value is lower than the threshold, we can reject the null hypothesis, implying that the two means are different. We see that this is an inferential statistics method as, from it, we can infer a fact about our data; in this case, it is whether the two means we are interested in are different from each other.

We will not go into the theoretical details of these tests; instead, we will see how we can simply take advantage of the API offered in Python, or specifically the SciPy library. We used this library in the last chapter, so if you are not yet familiar with the tool, be sure to head back to Chapter 2, Python's Main Tools for Statistics to see how it can be installed in your environment.

Let's design our sample experiment. Say we have two arrays of numbers, each was drawn from an unknown distribution, and we'd like to find out whether their respective means are equal to each other. Thus, we have our null hypothesis that the means of these two arrays are equal, which can be rejected if the p-value of our t-test is less than 0.05.

To generate the synthetic data for this example, we will use 20 samples from the standard form of the normal distribution (a mean of 0, and a standard deviation of 1) for the first array, and another 20 samples from a normal distribution with a mean of 0.2 and a standard deviation of 1 for the second array:

samples_a = np.random.normal(size=20)

samples_b = np.random.normal(0.2, 1, size=20)

To quickly visualize this dataset, we can use the plt.hist() function as follows:

plt.hist(samples_a, alpha=0.2)

plt.hist(samples_b, alpha=0.2)

plt.show()

This generates the following plot (note that your own output might be different):

Figure 3.22: Histogram of sample data for a t-test

Now, we will call the ttest_ind() function from the scipy.stats package. This function facilitates an independent samples t-test and will return an object having an attribute named pvalue; this attribute contains the p-value that will help us decide whether to reject our null hypothesis or not:

scipy.stats.ttest_ind(samples_a, samples_b).pvalue

The output is as follows:

0.8616483548091348

With this result, we do not reject our null hypothesis. Again, your p-value might be different from the preceding output, but chances are it is not lower than 0.05 either. Our final conclusion here is that we don't have enough evidence to say that the means of our two arrays are different (even though they were actually generated from two normal distributions with different means).

Let's repeat this experiment, but this time we have significantly more data—each array now contains 1,000 numbers:

samples_a = np.random.normal(size=1000)

samples_b = np.random.normal(0.2, 1, size=1000)

plt.hist(samples_a, alpha=0.2)

plt.hist(samples_b, alpha=0.2)

plt.show()

The histogram now looks like the following:

Figure 3.23: Histogram of sample data for a t-test

Running the t-test again, we see that this time, we obtain a different result:

scipy.stats.ttest_ind(samples_a, samples_b).pvalue

The output is as follows:

3.1445050317071093e-06

This p-value is a lot lower than 0.05, thus rejecting the null hypothesis and giving us enough evidence to say that the two arrays have different means.

These two experiments demonstrated a phenomenon we should keep in mind. In the first experiment, our p-value wasn't low enough for us to reject the null hypothesis, even though our data was indeed generated from two distributions with different means. In the second experiment, with more data, the t-test was more conclusive in terms of differentiating the two means.

In essence, with only 20 samples in each array, the first t-test didn't have a high enough level of confidence to output a lower p-value, even if the two means were indeed different. With 1,000 samples, this difference was more consistent and robust so that the second t-test was able to positively output a lower p-value. In general, many other statistical methods will similarly prove to be more conclusive as more data is used as input.

We have looked at an example of the independent samples t-test as a method of inferential statistics to test for the degree of difference between the averages of two given populations. Overall, the scipy.stats package offers a wide range of statistical tests that take care of all of the computation in the background and only return the final test output. This follows the general philosophy of the Python language, keeping the API at a high level so that users can take advantage of complex methodologies in a flexible and convenient manner.

Note

More details on what is available in the scipy.stats package can be found in its official documentation at https://docs.scipy.org/doc/scipy-0.15.1/reference/tutorial/stats.html.

Some of the most commonly used tests that can be called from the package include: t-tests or ANOVAs for differences in means; normality testing to ascertain whether samples have been drawn from a normal distribution; and computation of the Bayesian credible intervals for the mean and standard deviation of a sample population.

Moving away from the scipy.stats package, we have seen that the pandas library also supports a wide range of statistical functionalities, especially with its convenient describe() method. In the next section, we will look into the second inferential statistics method: the correlation matrix of a dataset.

Correlation Matrix

A correlation matrix is a two-dimensional table containing correlation coefficients between each pair of attributes of a given dataset. A correlation coefficient between two attributes quantifies their level of linear correlation, or in other words, how similarly they behave in a linear fashion. A correlation coefficient lies in the range between -1 and +1, where +1 denotes perfect linear correlation, 0 denotes no correlation, and -1 denotes perfect negative correlation.

If two attributes have a high linear correlation, then when one increases, the other tends to increase by the same amount multiplied by a constant. In other words, if we were to plot the data in the two attributes on a scatter plot, the inpidual points would tend to follow a line with a positive slope. For two attributes having no correlation, the best-fit line tends to be horizontal, and two attributes having a negative correlation are represented by a line with a negative slope.

The correlation between two attributes can, in a way, tell us how much information is shared among the attributes. We can infer from two correlated attributes, either positively or negatively, that there is some underlying relationship between them. This is the idea behind the correlation matrix as an inferential statistics tool.

In some machine learning models, it is recommended that if we have highly correlated features, we should only leave one in the dataset before feeding it to the models. In most cases, having another attribute that is highly correlated to one that a model has been trained on does not improve its performance; what's more, in some situations, correlated features can even mislead our models and steer their predictions in the wrong direction.

This is to say that the correlation coefficient between two data attributes, and thus the correlation matrix of the dataset, is an important statistical object for us to consider. Let's see this in a quick example.

Say we have a dataset of three attributes, 'x', 'y', and 'z'. The data in 'x' and 'z' is randomly generated in an independent way, so there should be no correlation between them. On the other hand, we will generate 'y' as the data in 'x' multiplied by 2 and add in some random noise. This can be done with the following code, which creates a dataset with 500 entries:

x = np.random.rand(500,)

y = x * 2 + np.random.normal(0, 0.3, 500)

z = np.random.rand(500,)

df = pd.DataFrame({'x': x, 'y': y, 'z': z})

From here, the correlation matrix (which, again, contains correlation coefficients of every pair of attributes in our dataset) can be easily computed with the corr() method:

df.corr()

The output is as follows:

     x y z

x 1.000000 0.8899950.869522 0.019747 -0.017913

y 0.8899950.869522 1.000000 0.045332 -0.023455

z 0.019747 -0.017913 0.045332 -0.023455 1.000000

We see that this is a 3 x 3 matrix, as there are three attributes in the calling DataFrame object. Each number denotes the correlation between the row and the column attributes. One effect of this representation is that we have all of the diagonal values in the matrix as 1, as each attribute is perfectly correlated to itself.

What's more interesting to us is the correlation between different attributes: as 'z' was generated independently of 'x' (and therefore 'y'), the values in the 'z' row and column are relatively close to 0. In contrast to this, the correlation between 'x' and 'y' is quite close to 1, as one was generated to be roughly two times the other.

Additionally, it is common to visually represent the correlation matrix with a heatmap. This is because when we have a large number of attributes in our dataset, a heatmap will help us identify the regions that correspond to highly correlated attributes more efficiently. The visualization of a heatmap can be done using the sns.heatmap() function from the seaborn library:

sns.heatmap(df.corr(), center=0, annot=True)

bottom, top = plt.ylim()

plt.ylim(bottom + 0.5, top - 0.5)

plt.show()

The annot=True argument specifies that the values in the matrix should be printed out in each cell of the heatmap.

The code will produce the following:

Figure 3.24: Heatmap representing a correlation matrix

In this case, while visually inspecting a correlation matrix heatmap, we can focus on the bright regions, aside from the diagonal cells, to identify highly correlated attributes. If there were negatively correlated attributes in a dataset (which we don't have in our current example), those could be detected with dark regions as well.

Overall, the correlation matrix of a given dataset can be a useful tool for us to understand the relationship between the different attributes of that dataset. We will see an example of this in the upcoming exercise.

Exercise 3.04: Identifying and Testing Equality of Means

In this exercise, we will practice the two inferential statistics methods to analyze a synthetic dataset that we have generated for you. The dataset can be downloaded from the GitHub repository at https://packt.live/3ghKkDS.

Here, our goal is to first identify which attributes in this dataset are correlated with each other and then apply a t-test to determine whether any pair of attributes have the same mean.

With that said, let's get started:

  1. In a new Jupyter notebook, import pandas, matplotlib, seaborn, and the ttest_ind() method from the stats module from SciPy:

    import pandas as pd

    from scipy.stats import ttest_ind

    import matplotlib.pyplot as plt

    import seaborn as sns

  2. Read in the dataset that you have downloaded. The first five rows should look like the following:

    Figure 3.25: Reading the first five rows of the dataset

  3. In the next code cell, use seaborn to generate the heatmap that represents the correlation matrix for this dataset. From the visualization, identify the pair of attributes that are correlated with each other the most:

    sns.heatmap(df.corr(), center=0, annot=True)

    bottom, top = plt.ylim()

    plt.ylim(bottom + 0.5, top - 0.5)

    plt.show()

    This code should produce the following visualization:

    Figure 3.26: Correlation matrix for the dataset

    From this output, we see that attributes 'x' and 'y' have a correlation coefficient that is quite high: 0.94.

  4. Using this jointplot() method in seaborn, create a combined plot with two elements: a scatter plot on a two-dimensional plane where the coordinates of the points correspond to the inpidual values in 'x' and 'y' respectively, and two histograms representing the distributions of those values. Observe the output and decide whether the two distributions have the same mean:

    sns.jointplot(x='x', y='y', data=df)

    plt.show()

    This will produce the following output:

    Figure 3.27: Combined plot of correlated attributes

    From this visualization, it is not clear whether the two attributes have the same mean or not.

  5. Instead of using a visualization, run a t-test with 0.05 level of significance to decide whether the two attributes have the same mean:

    ttest_ind(df['x'], df['y']).pvalue

    This command will have the following output:

    0.011436482008949079

    This p-value is indeed lower than 0.05, allowing us to reject the null hypothesis that the two distributions have the same mean, even though they are highly correlated.

In this exercise, we applied the two inferential statistics methods that we have learned in this section to analyze a pair of correlated attributes in a dataset.

Note

To access the source code for this specific section, please refer to https://packt.live/31Au1hc.

You can also run this example online at https://packt.live/2YTt7L7.

In the next and final section on the topic of inferential statistics, we will discuss the process of using statistical and machine learning models as a method of making inferences using statistics.

Statistical and Machine Learning Models

Modeling a given dataset using a mathematical or machine learning model, which in itself is capable of generalizing any potential patterns and trends in the dataset to unseen data points, is another form of inferential statistics. Machine learning itself is arguably one of the fastest-growing fields in computer science. However, most machine learning models actually leverage mathematical and statistical theories, which is why the two fields are heavily connected. In this section, we will consider the process of training a model on a given dataset and how Python can help facilitate that process.

It is important to note that a machine learning model does not actually learn in the same sense that humans do. Most of the time, a model attempts to solve an optimization problem that minimizes its training error, which represents how well it can process the pattern within the training data, with the hope that the model can generalize well on unseen data that is drawn from the same distributions as the training data.

For example, a linear regression model generates the line of best fit that passes through all the data points in a given dataset. In the model definition, this line corresponds to the line that has the minimal sum of distances to the inpidual data points, and by solving the optimization problem of minimizing the sum of distances, a linear regression model is able to output that best-fitted line.

Overall, each machine learning algorithm models the data and therefore the optimization problem in a different way, each suitable for specific settings. However, different levels of abstraction built into the Python language allow us to skip through these details and apply different machine learning models at a high level. All we need to keep in mind is that statistical and machine learning models are another method of inferential statistics where we are able to make predictions on unseen data, given the pattern represented in a training dataset.

Let's say we are given the task of training a model on the sample dataset we have in the previous section, where the learning features are 'x' and 'z', and our prediction target is 'y'. That is, our model should learn any potential relationship between 'x' or 'z' and 'y', and from there know how to predict unseen values of 'y' from the data in 'x' and 'z'.

Since 'y' is a numerical attribute, we will need a regression model, as opposed to a classifier, to train on our data. Here, we will use one of the most commonly used regressors in statistics and machine learning: linear regression. For that, we will require the scikit-learn library, one of the most—if not the most—popular predictive data analysis tools in Python.

To install scikit-learn, run the following pip command:

$ pip install scikit-learn

You can also use the conda command to install it:

$ conda install scikit-learn

Now, we import the linear regression model and fit it to our training data:

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(df[['x', 'z']], df['y'])

In general, the fit() method, called by a machine learning model object, takes in two arguments: the independent features (that is, the features that will be used to make predictions), which in this case are 'x' and 'z', and the dependent feature or the prediction target (that is, the attribute that we'd like to make predictions on), which in this case is 'y'.

This fit() method will initiate the training process of the model on the given data. Depending on the complexity of the model as well as the size of the training data, this process might take a significant amount of time. For a linear regression, however, the training process should be relatively fast.

Once our model has finished training, we can look at its various statistics. What statistics are available depends on the specific model being used; for a linear regression, it is common for us to consider the coefficients. A regression coefficient is an estimate of the linear relationship between an independent feature and the prediction target. In essence, the regression coefficients are what the linear regression model estimates for the slope of the best-fit line for a specific predictor variable, 'x' or 'z' in our case, and the feature we'd like to predict—'y'.

These statistics can be accessed as follows:

model.coef_

This will give us the following output:

array([1.98861194, 0.05436268])

Again, the output from your own experiment might not be exactly the same as the preceding. However, there is a clear trend to these coefficients: the first coefficient (denoting the estimated linear relationship between 'x' and 'y') is approximately 2, while the second (denoting the estimated linear relationship between 'z' and 'y') is close to 0.

This result is quite consistent with what we did to generate this dataset: 'y' was generated to be roughly equal to the elements in 'x' multiplied by 2, while 'z' was independently generated. By looking at these regression coefficients, we can obtain information about which features are the best (linear) predictors for our prediction target. Some consider these types of statistics to be explainability/interpretability statistics, as they give us insights regarding how the prediction process was done.

What's more interesting to us is the process of making predictions on unseen data. This can be done by calling the predict() method on the model object like so:

model.predict([[1, 2], [2, 3]])

The output will be as follows:

array([2.10790143, 4.15087605])

Here, we pass to the predict() method any data structure that can represent a two-dimensional table (in the preceding code, we used a nested list, but in theory, you could also use a two-dimensional NumPy array or a pandas DataFrame object). This table needs to have its number of columns equal to the number of independent features in the training data; in this case, we have two ('x' and 'z'), so each sub-list in [[1, 2], [2, 3]] has two elements.

From the predictions produced by the model, we see that when 'x' is equal to 1 and 'z' is equal to 2 (our first test case), the corresponding prediction is roughly 2. This is consistent with the fact that the coefficient for 'x' is approximately 2 and the one for 'z' is close to 0. The same goes for the second test case.

And that is an example of how a machine learning model can be used to make predictions on data. Overall, the scikit-learn library offers a wide range of models for different types of problems: classification, regression, clustering, dimensionality reduction, and so on. The API among the models is consistent with the fit() and predict() methods, as we have seen. This allows a greater degree of flexibility and streamlining.

An important concept in machine learning is model selection. Not all models are created equal; some models, due to their design or characteristics, are better suited to a given dataset than others. This is why model selection is an important phase in the whole machine learning pipeline. After collecting and preparing a training dataset, machine learning engineers typically feed the dataset to a number of different models, and some models might be excluded from the process due to poor performance.

We will see a demonstration of this in the following exercise, where we are introduced to the process of model selection.

Exercise 3.05: Model Selection

In this exercise, we will go through a sample model selection procedure, where we attempt to fit three different models to a synthetic dataset and consider their performance:

  1. In the first code cell of a new Jupyter notebook, import the following tools:

    import numpy as np

    from sklearn.datasets import make_blobs

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import accuracy_score

    from sklearn.neighbors import KNeighborsClassifier

    from sklearn.svm import SVC

    from sklearn.ensemble import GradientBoostingClassifier

    import matplotlib.pyplot as plt

    Note

    We are not yet familiar with some of the tools, but they will be explained to us as we go through this exercise.

    Now, we'd like to create a synthetic dataset of points lying on a two-dimensional plane. Each of these points belongs to a specific group, and points belonging to the same group should revolve around a common center point.

  2. This synthetic data can be generated using the make_blobs function that we have imported from the sklearn.datasets package:

    n_samples = 10000

    centers = [(-2, 2), (0, 0), (2, 2)]

    X, y = make_blobs(n_samples=n_samples, centers=centers, \

                      shuffle=False, random_state=0)

    As we can see, this function takes in an argument named n_samples, which specifies the number of data points that should be produced. The centers argument, on the other hand, specifies the total number of groups that the inpidual points belong to and their respective coordinates. In this case, we have three groups of points centering around (-2, 2), (0, 0), and (2, 2) respectively.

  3. Lastly, by specifying the random_state argument as 0, we ensure that the same data is generated every time we rerun this notebook. As we mentioned in Chapter 1, Fundamentals of Python, this is good practice in terms of reproducibility.

    Our goal here is to train various models on this data so that when fed a new list of points, the model can decide which group each point should belong to with high accuracy.

    This function returns a tuple of two objects that we are assigning to the variables X and y, respectively. The first element in the tuple contains the independent features of the dataset; in this case, they are the x and y coordinates of the points. The second tuple element is our prediction target, the index of the group each point belongs to. The convention is to store the independent features in a matrix named X, and the prediction targets in a vector named y, as we are doing.

  4. Print out these variables to see what we are dealing with. Type X as the input:

    X

    This will give the following output:

    array([[-0.23594765, 2.40015721],

           [-1.02126202, 4.2408932 ],

           [-0.13244201, 1.02272212],

           ...,

           [ 0.98700332, 2.27166174],

           [ 1.89100272, 1.94274075],

           [ 0.94106874, 1.67347156]])

    Now, type y as the input:

    y

    This will give the following output:

    array([0, 0, 0, ..., 2, 2, 2])

  5. Now, in a new code cell, we'd like to visualize this dataset using a scatter plot:

    plt.scatter(X[:, 0], X[:, 1], c=y)

    plt.show()

    We use the first attribute in our dataset as the x coordinates and the second as the y coordinates for the points in the scatter plot. We can also quickly specify that points belonging to the same group should have the same color by passing our prediction target y to argument c.

    This code cell will produce the following scatter plot:

    Figure 3.28: Scatter plot for a machine learning problem

    The most common strategy of a model selection process is to first split our data into a training dataset and a test/validation dataset. The training dataset is used to train the machine learning models we'd like to use, and the test dataset is used to validate the performance of those models.

  6. The train_test_split() function from the sklearn.model_selection package facilitates the process of splitting our dataset into the training and test datasets. In the next code cell, enter the following code:

    X_train, X_test, \

    y_train, y_test = train_test_split(X, y, shuffle=True, \

                                       random_state=0)

    As we can see, this function returns a tuple of four objects, which we are assigning to the four preceding variables: X_train contains the data in the independent features for the training dataset, while X_test contains the data of the same features for the test dataset, and the equivalent goes for y_train and y_test.

  7. We can inspect how the split was done by considering the shape of our training dataset:

    X_train.shape

    (7500, 2)

    By default, the training dataset is randomly selected from 75 percent of the input data, and the test dataset is the remaining data, randomly shuffled. This is demonstrated by the preceding output, where we have 7,500 entries in our training dataset from the original data with 10,000 entries.

  8. In the next code cell, we will initialize the machine learning models that we have imported without specifying any hyperparameters (more on this later):

    models = [KNeighborsClassifier(), SVC(),\

              GradientBoostingClassifier()]

  9. Next, we will loop through each of them, train them on our training dataset, and finally compute their accuracy on the test dataset using the accuracy_score function, which compares the values stored in y_test and the predictions generated by our models in y_pred:

    for model in models:

        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)

        print(f'{type(model).__name__}: {accuracy_score(y_pred, y_test)}')

    Again, the fit() method is used to train each model on X_train and y_train, while predict() is used to have the models make predictions on X_test. This will produce an output similar to the following:

    KNeighborsClassifier: 0.8792

    SVC: 0.8956

    GradientBoostingClassifier: 0.8876

From here, we see that the SVC model performed the best, which is somewhat expected as it is the most complex model out of the three used. In an actual model selection process, you might incorporate more tasks, such as cross-validation, to ensure that the model you select in the end is the best option.

And that is the end of our model selection exercise. Through the exercise, we have familiarized ourselves with the general procedure of working with a scikit-learn model. As we have seen, the fit/predict API is consistent across all models implemented in the library, which leads to a high level of flexibility and convenience for Python programmers.

This exercise also concludes the general topic of inferential statistics.

Note

To access the source code for this specific section, please refer to https://packt.live/2BowiBI.

You can also run this example online at https://packt.live/3dQdZ5h.

In the next and final section of this chapter, we will iterate a number of other libraries that can support various specific statistical procedures.