Skip to content

Lec 2: Image Classification

Challenges

Take cat classification (the images are chosen either between cats and other animals, or between different breeds of cats)

  1. Semantic Gap: There's no obvious way to convert the RGB map into the semantically meaningful category label of cat

  2. Viewpoint Variation: All pixels change (dramatically) when the camera moves!

    image-20240308023245705

  3. Intraclass Variation: Different breeds of cats have distinct RGB map. But we must find the common feature in them.

    image-20240308024311264

  4. Fine-Grained Categories: Different breeds of cats are still cats, so have similar features as well. We must extract more detailed features to distinguish them.

    image-20240308024448040

  5. Background Clutter: Sometimes, the images we want to recognize somehow blend into the background

    image-20240308024615665

  6. Illumination Changes & Deformation & Occlusion (i.e. The Object is Blocked By Something):

    image-20240308024904617

    image-20240308024935614

    image-20240308025113228

    • So, actually, the animal under a cushion might be a racoon. However, our common sense tell us that
      • cats are likely to appear in homes
      • cats can sometimes hide under cushions
      • racoons are very unlikely to appear in homes

Naive Approaches

We can use nearest neighbor approach.

That is,

  • we use \(L^1\) norm to calculate the "distance" between the test image and all training images,
  • find the training image that has the smallest distance, and
  • give the prediction that the test image is of the same category as the nearest training image.

To enhance the robustness, we might use the nearest k-th neighbor.


Actually, k-th nearest neighbor algorithm is practical if you choose the right metric / right data.

For example,

  • considering this arXiv paper recommendation system. It uses a metric called tf-idf.

  • Also, using feature vectors instead of raw pixels in KNN can make good predictions.

Set Hyperparameters

Always divide your dataset to three disjoint parts:

  • training set: where you get your model
  • validation set: where you test your model on and tune your hyperparameters (e.g. the \(k\) in \(k\)-th nearest neighbors and the metric we use)
    • NOTE: the only purpose of the validation set is to let you compare the performance of models based on different hyperparameters.
  • test set: you can only use it to test once on your model. If the result is bad, you fucked you; otherwise, congratulations!

Also, you can do cross validation. That is, split data into folds, try each fold as validation and average the results

image-20240309034519148

  • We do this, because averaging means better than worst case.

Summary

  • In Image classification we start with a training set of images and labels, and must predict labels on the test set.
  • Image classification is challenging due to the semantic gap: we need invariance to occlusion, deformation, lighting, intraclass variation, etc
  • Image classification is a building block for other vision tasks
  • The K-Nearest Neighbors classifier predicts labels based on nearest training examples
  • Distance metric and K are hyperparameters
  • Choose hyperparameters using the validation set; only run on the test set once at the very end!