dfg454

۱۰ بازديد
h quality open source CNN
software packages have been made available.
There are also well-written CNN tutorials or CNN software manuals. How-
ever, I believe that an introductory CNN material specifically prepared for be-
object detection in images, etc. We will focus on image classification (or catego-
rization) in this note. In image categorization, every image has a major object
which occupies a large portion of the image. An image is classified into one of
the classes based on the identity of its main object, e.g., dog, airplane, bird, etc.
2 Preliminaries
We start by a discussion of some background knowledge that are necessary in
order to understand how a CNN runs. One can ignore this section if he/she is
familiar with these basics.
2.1 Tensor and vectorization
Everybody is familiar with vectors and matrices. We use a symbol shown in
boldface to represent a vector, e.g., x RD is a column vector with D elements.
We use a capital letter to denote a matrix, e.g., X RH×W is a matrix with
H rows and W columns. The vector x can also be viewed as a matrix with 1
column and D rows.
These concepts can be generalized to higher-order matrices, i.e., tensors. For
example, x RH×W ×D is an order 3 (or third order) tensor. It contains HW D
elements, and each of them can be indexed by an index triplet (i, j, d), with
0 i < H, 0 j < W , and 0 d < D. Another way to view an order 3 tensor
is to treat it as containing D channels of matrices. Every channel is a matrix
with size H × W . The first channel contains all the numbers in the tensor that
are indexed by (i, j, 0). When D = 1, an order 3 tensor reduces to a matrix.
We have interacted with tensors day-to-day. A scalar value is a zeroth-order
(order 0) tensor; a vector is an order 1 tensor; and a matrix is a second order
tensor. A color image is in fact an order 3 tensor. An image with H rows and
W columns is a tensor with size H × W × 3: if a color image is stored in the
RGB format, it has 3 channels (for R, G and B, respectively), and each channel
is a H × W matrix (second order tensor) that contains the R (or G, or B) values
of all pixels.
It is beneficial to represent images (or other types of raw data) as a tensor.
In early computer vision and pattern recognition, a color image (which is an
order 3 tensor) is often converted to the gray-scale version (which is a matrix)
because we know how to handle matrices much better than tensors. The color
information is lost during this conversion. But color is very important in various
image (or video) based learning and recognition problems, and we do want to
process color information in a principled way, e.g., as in CNN.
Tensors are essential in CNN. The input, intermediate representation, and
parameters in a CNN are all tensors. Tensors with order higher than 3 are
also widely used in a CNN. For example, we will soon see that the convolution
kernels in a convolution layer of a CNN form an order 4 tensor.
Given a tensor, we can arrange all the numbers inside it into a long vec-
tor, following a pre-specified order. For example, in Matlab, the (:) operator
3
 
 
converts a matrix into a column vector in the column-first order. An example
is:
A =
[ 1 2
3 4
]
, A(:) = (1, 3, 2, 4)T =




1
3
2
4



. (1)
In mathematics, we use the notation “vec” to represent this vectorization
operator. That is, vec(A) = (1, 3, 2, 4)T in the example in Equation 1. In order
to vectorize an order 3 tensor, we could vectorize its first channel (which is a
matrix and we already know how to vectorize it), then the second channel, . . . ,
till all channels are vectorized. The vectorization of the order 3 tensor is then
the concatenation of the vectorization of all the channels in this order.
The vectorization of an order 3 tensor is a recursive process, which utilizes
the vectorization of order 2 tensors. This recursive process can be applied to
vectorize an order 4 (or even higher order) tensor in the same manner.
2.2 Vector calculus and the chain rule
The CNN learning process depends on vector calculus and the chain rule. Sup-
pose z is a scalar (i.e., z R) and y RH is a vector. If z is a function of y,
then the partial derivative of z with respect to y is a vector, defined as
[ ∂z
y
]
i
= ∂z
∂yi
. (2)
In other words, ∂z
y is a vector having the same size as y, and its i-th element
is ∂z
∂yi . Also note that ∂z
yT =
( ∂z
y
)T
.
Furthermore, suppose x RW is another vector, and y is a function of x.
Then, the partial derivative of y with respect to x is defined as
[ y
xT
]
ij
= ∂yi
∂xj
. (3)
This partial derivative is a H × W matrix, whose entry at the intersection of
the i-th row and j-th column is ∂yi
∂xj .
It is easy to see that z is a function of x in a chain-like argument: a functionl location until we have moved the kernel to
the bottom right corner of the input image, as shown in Figure 3.
For order 3 tensors, the convolution operation is defined similarly. Suppose
the input in the l-th layer is an order 3 tensor with size Hl × W l × Dl. A
convolution kernel is also an order 3 tensor with size H × W × Dl. When we
overlap the kernel on top of the input tensor at the spatial location (0, 0, 0),
we compute the products of corresponding elements in all the Dl channels and
sum the HW Dl products to get the convolution result at this spatial location.
Then, we move the kernel from top to bottom and from left to right to complete
the convolution.
In a convolution layer, multiple convolution kernels are usually used. As-
suming D kernels are used and each kernel is of spatial span H × W , we denote
all the kernels as f . f is an order 4 tensor in RH×W ×Dl×D . Similarly, we use(a) Lenna (b) Horizontal edge (c) Vertical edge
Figure 4: The Lenna image and the effect of different convolution kernels.
only at horizontal or vertical edges in certain directions. If we replace the So-
bel kernel by other kernels (e.g., those learned by SGD), we can learn features
that activate for edges with different angles. When we move further down in the
deep network, subsequent layers can learn to activate only for specific (but more
complex) patterns, e.g., groups of edges that form a particular shape. These
more complex patterns will be further assembled by deeper layers to activate for
semantically meaningful object parts or even a particular type of object, e.g.,
dog, cat, tree, beach, etc.
One more benefit of the convolution layer is that all spatial locations share
the same convolution kernel, which greatly reduces the number of parameters
needed for a convolution layer. For example, if multiple dogs appear in an input
image, the same “dog-head-like pattern” feature will be activated at multiple
locations, corresponding to heads of different dogs.
In a deep neural network setup, convolution also encourages parameter shar-
ing. For example, suppose “dog-head-like pattern” and “cat-head-like pattern”
are two features learned by a deep convolutional network. The CNN does not
need to devote two sets of disjoint parameters (e.g., convolution kernels in mul-
tiple layers) for them. The CNN’s bottom layers can learn “eye-like pattern”
and “animal-fur-texture pattern”, which are shared by both these more abstractuct between φ(xl)T (the
im2col expansion) and ∂z
∂Y (the supervision signal transferred from the (l+1)-th
layer).
6.6 Even higher dimensional indicator matrices
The function φ(·) has been very useful in our analysis. It is pretty high dimen-
sional, e.g., φ(xl) has Hl+1W l+1HW Dl elements. From the above, we know
that an element in φ(xl) is indexed by a pair p and q.
A quick recap about φ(xl): 1) from q we can determine dl, which channel
of the convolution kernel is used; and can also determine i and j, the spatial
offsets inside the kernel; 2) from p we can determine il+1 and jl+1, the spatial
offsets inside the convolved result xl+1; and, 3) the spatial offsets in the input
xl can be determined as il = il+1 + i and jl = jl+1 + j.
That is, the mapping m : (p, q) 7 (il, jl, dl) is one-to-one, and thus is
a valid function. The inverse mapping, however, is one-to-many (thus not a
valid function). If we use m1 to represe
features. In short, the combination of convolution kernels and deep and hier-
archical structures are very effective in learning good representations (features)
from images for visual recognition tasks.
We want to add a note here. Although we have used phrases such as “dog-
head-like pattern”, the representation or feature learned by a CNN may not
correspond exactly to semantic concepts such as “dog’s head”. A CNN feature
may activate frequently for dogs’ heads and often be deactivated for other types
of patterns. However, there are also possible false activations at other locations,
and possible deactivations at dogs’ heads.
In fact, a key concept in CNN (or more generally deep learning) is distributed
representation. For example, suppose our task is to recognize N different types
of objects and a CNN extracts M
index variables 0 i < H, 0 j < W , 0 dl < Dl and 0 d < D to pinpoint
a specific element in the kernels. Also note that the set of kernels f refers to
the same object as the notation wl in Equation 5. We change the notation a
bit to make the derivation a little bit simpler. It is also clear that even if the
mini-batch strategy is used, the kernels remain unchanged.
As shown in Figure 3, the spatial extent of the output is smaller than that
of the input so long as the convolution kernel is larger than 1 × 1. Sometimes
we need the input and output images to have the same height and width, and a
simple padding trick can be used. If the input is Hl ×W l ×Dl and the kernel size
is H ×W ×Dl ×D, the convolu
maps x to y, and another function maps y to z. The chain rule can be used to
compute ∂z
xT , as
∂z
xT = ∂zts the gradient of some features in the
l-th layer to 0, but these features are not activated (i.e., we are not interested
in them). For those activated features, the gradient is back propagated without
any change, which is beneficial for SGD learning. The introduction of ReLU to
replace sigmoid is an important change in CNN, which significantly reduces the
difficulty in learning CNN parameters and improves its accuracy. There are also
more complex variants of ReLU, for example, parametric ReLU and exponential
linear unit.
6 The convolution layer
Next, we turn to the convolution layer, which is the most involved one among
those we discuss in this note.
6.1 What is convolution?
Let us start by convolving a matri
yT
y
xT . (4)
A sanity check for Equation 4 is to check the matrix / vector dimensions.
Note that ∂z
yT is a row vector with H elements, or a 1×H matrix. (Be reminded
that ∂z
y is a column vector). Since y
xT is an H × W matrix, the vector / matrix
multiplication between them is valid, and the result should be a row vector with
W elements, which matches the dimensionality of ∂z
xT .
4
 
 
ginners is still needed. Research papers are usually very terse and lack details.
It might be difficult for beginners to read such papers. A tutorial targeting
experienced researchers may not cover all the necessary details to understand
how a CNN runs.
This note tries to present a document that
is self-contained. It is expected that all required mathematical background
knowledge are introduced in this note itself (or in other notes for this
course);
has details for all the derivations. This note tries to explain all the nec-
essary math in details. We try not to ignore an important step in a
derivation. Thus, it should be possible for a beginner to follow (although
an expert may feel this note tautological.)
ignores implementation details. The purpose is for a reader to under-
stand how a CNN runs at the mathematical level. We will ignore those
implementation details. In CNN, making correct choices for various im-
plementation details is one of the keys to its high accuracy (that is, “the
devil is in the details”). However, we intentionally left this part out,
in order for the reader to focus on the mathematics. After understand-
ing the mathematical principles and details, it is more advantageous to
learn these implementation and design details with hands-on experience
by playing with CNN programming.
تا كنون نظري ثبت نشده است
ارسال نظر آزاد است، اما اگر قبلا در وی بلاگ ثبت نام کرده اید می توانید ابتدا وارد شوید.