Anomaly detection

This algorithm can help us to realize that whether some data sets are abnormal. We should follow the steps:

(1)Choose the features that you think may relate to anomalous examples


Andrew Ng

Andrew Ng

we can separate the data set into 3 parts. 60% training set(all samples should be not anomalous ), 20% cross validation set (50% anomalous samples inside), 20% test set(the other 50% anomalous samples inside).

The traning set is used to compute P(x) above. The cross validation set is used to verify the modle. Because, our data is skewed (just small number of anomalous samples) , the cost should be F1 score.

Anomaly detection VS supervised learning

Because, supervised learning is quite similar to anomaly detection, it's necessary to know which should be used in different situations.

Anomaly detection should be used in the following 2 situations.

(1) The future anomalous sample has a quite different feature than the sample in the data set.

(2) Anomalous samples are very small amount(10 - 50 ).

After we choose our features and find that our anomalous point is near not anomalous points, we'd better choose one more feature.

Recommended content based system using machine learning

There're several steps we need to follow:

(1) Feature  vector 

Andrew Ng

(2) And we need to minimize the cost function

Andrew Ng

It's awesome that we can use linear regression to solve this problem.


Andrew Ng

Our goal is to minimize this cost function, When we are minimizing x, theta should be constant. While minimizing theta, x should be constant. Remember i refers to the movie while j refers to the user.