A little something about Machine learning

Thursday, 21st of January



During my second last semester of uni, I ended up taking a unit called 'Social Computing'. The unit was not technical at all and was mostly a forum about how social computing giants came to be so successful. The lecturer did happen to touch upon the role that machine learning has played in the age of social computing. He described how the immense amount of data and metadata currently available to us has provided a foundation for deep analysis and modeling as well as further allowing relatively accurate predictions of new data.
With enough data and we can apply machine learning to anything from facial recognition, text analysis to identifying celestial bodies.

Below, I walk through my first interaction with machine learning using Python 3.4+ and Scikit Learn.


First we need to install a package management system for python such as Pypi using apt-get

console
$ sudo apt-get install python3-pip

Next we install the required packages for scikit-learn which are Numpy and SciPy.

console
$ python3 -m pip install -U numpy

$ python3 -m pip install -U scipy

Using pip to install SciPy threw me an error, so a workaround is to use apt-get instead.

console
$ sudo apt-get install python3-scipy

Now we can go ahead and install Scikit Learn

console
$ python3 -m pip install -U scikit-learn

To make sure everything has installed correctly, we open a console window, enter the python 3 interpreter, import each package and see if we get an error.

console
$ python3
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>>
>>> import scipy
>>>
>>> import sklearn
>>>

We know each package has been installed successfully since no errors were thrown at us.

Now that all the boring stuff is out of the way, we can start playing with Scikit Learn! Copy and paste the code below into your python 3 interpreter to follow along.


from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn import metrics


iris = load_iris()
X = iris.data
y = iris.target

The First thing we need is a dataset. We can import the famous Iris data set through the load_iris function from the sklearn.dataset package.
Next, we will need to import a way to split our dataset into training and testing group using the train_test_split function from sklearn.cross_validation package. We will explain why this is necessary later.
Thirdly, we will import a function used to classify our data into groups. We will use the the KNeighborsClassifier from sklearn.neighbors

Before we continue a few things need to be explained. Firstly, this is a supervised learning problem. Which means that we will need to provide our training algorithm with labels which correspond to group (in our case 0, 1, 2 for setosa, versicolor and virginica).
Also, this is a classification problem. We know this because our problem requires us to classify our Iris' based on the relationship between each of characteristics (sepal and petal length and width in cm) into group of discrete variables 0, 1, 2 (our labels setosa, versicolor, virginica).

Finally, it would be useful to get a grasp of certain metrics such as prediction accuracy. For that we will import metrics from sklearn

Once we've imported in all our requirements, we can go ahead and load in our Iris dataset iris = load_iris() Our feature matrix X = iris.data and finally our response vector. y = iris.target

console
>>> type(iris) class 'sklearn.datasets.base.Bunch' >>> iris.keys() dict_keys(['feature_names', 'target', 'data', 'DESCR', 'target_names']) >>> print(iris.DESCR) Iris Plants Database Notes ----- Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: ...
>>> list(iris.target_names) ['setosa', 'versicolor', 'virginica'] >>> iris.feature_names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)'] >>> X.shape (150L, 4L) >>> y.shape (150L, )

iris is a 'Dictionary-like' object, which returns a type 'Bunch'. This is scikit-learn's own data structure. It consists of several attributes which we can see by calling iris.keys()

Also notice the shape of our feature matrix X.shape and our response vector y.shape Our feature matrix must always be a two dimensional object, in this case a 2D numpy array. It consists of 150 observations (rows) and 4 categories (columns) (sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)) also known as features. Our response vector is as 1 dimensional array where each value relates to each observation.

# sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label - Iris species
0 5.1 3.5 1.4 0.2 0 - setosa
50 7 3.2 4.7 1.4 1 - versicolor
148 6.2 3.4 5.4 2.3 2 - virginica

The above table is a representation of our feature matrix and response vector.

Ok so now that we have our data, we can train a model and make predictions. Copy and paste the code below to follow along.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
knn = KNC(n_neighbors = 5)
knn.fit(X_train, y_train)

Earlier we mentioned our train_test_split function. The point of this function is to divide our data set into two separate groups. One group to train our model, and the other to test our model on. If we train and test on the same dataset, our predictions aren't really predictions at all but rather already known information, giving us a false prediction accuracy of 100%, which wont be useful when trying to predict with new data. Instead, by dividing our data, we can use one subset to train our model on, and use the other subset to test our prediction accuracy. That way, the test subset is technically 'new' data and unknown to our model

We pass our feature matrix and response vector X, y to train_test_split() and set test_size = 0.2 (our test subset will be 20% of our original data set). In return, we get 4 objects. Two feature matrix' X_train, X_test one for training and one for testing as well as two response vectors y_train, y_test one for training and one for testing.

Once we have our training and test sets, we can define our classifier. The classifier we're using is K Neighbors Classifier. According to Scikit Learn's documentation:

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.
The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning).
- scikit-learn.org

Ideally, when using the K neighbors classifier, we would run a few tests to see what number of neighbors gives us the best prediction accuracy score. But for this blog post, we'll predefine it to 5 KNC(n_neighbors = 5). This function returns a K neighbors classifier object knn, we then train our classifier by passing our feature matrix training subset and response vector training subset X_train, y_train to knn.fit()

console
>>> y_pred = knn.predict(X_test) >>> print(metrics.accuracy_score(y_test, y_pred)) 0.933333333333 >>> knn.predict( [[3, 5, 3, 2]] ) array([2]) >>> iris.target_names[knn.predict( [[3, 5, 3, 2]] )][0] 'versicolor'

Now that we've trained our model, lets check our prediction accuracy and try making a prediction. In the our python interpreter, lets pass our X_test subset to knn.predict(). In return, we get a response vector of predicted labels. We then compare the known labels to our predicted labels by printing metrics.accuracy_score(y_test, y_pred). In this instance, we get an accuracy score of 93.33%. Now if we pass a new feature matrix with a single observation to knn.predict() we will get an array back with a value which corresponds to one of our 3 label values (0, 1, 2). For a little verbosity, try iris.target_names[knn.predict([[3, 5, 3, 2]])][0]. This just uses the return value of the prediction as an index for iris.target_names which give us the name of our predicted Iris flower.


Please note that since Scikit-Learn 0.19, 1 dimensional array passed to knn.predict() will require X.shape(1, -1) if there is only a single sample.

Email:

Username:

WilliamTrusy @ 7:13 PM on Tue, 10 of January

Say Goodbye To Panic Attack With These Tips Did you know that people with higher than normal intelligence are prone to panic attacks? If you experience these attacks, hopefully this knowledge along with the other helpful tips in this article will assist you with finding peace of mind in trying to deal with your situation. <a href=https://www.acheterviagrafr24.com/>https://www.acheterviagrafr24.com/</a>

JefferyglUch @ 3:22 AM on Mon, 2 of October

Stress can have an unbelievable impact on health. It can come from a variety of sources and have a diversity of manifestations. The tips that are outlined below will aid in the identification of the factors that cause stress and in the steps that we can take to reduce its impacts or eliminate them entirely. <a href=https://www.acheterviagrafr24.com/viagra-feminin-maroc/>viagra feminin maroc</a>

GuestSnulk @ 2:30 AM on Fri, 6 of April

guest test post <a href=" http://temresults2018.com/ ">bbcode</a> <a href="http://temresults2018.com/">html</a> http://temresults2018.com/ simple