A sequence ${A}_{k}$ of matrices converges to a matrix $A$ if, for all $\u03f5>0$ , there exists an $N$ such that $$ whenever $k\ge N$ .
Optional: Matrix norms
Matrix norms
We defined $\mathrm{exp}(A)={\sum}_{n=0}^{\mathrm{\infty}}\frac{1}{n!}{A}^{n}$ , ${A}^{0}=I$ , and claimed that it converges. What does it mean for a sequence of matrices to converge?
Here, $\parallel \cdot \parallel $ denotes a matrix norm, i.e. a way of measuring how "big" a matrix is.
A

(triangle inequality) $\parallel A+B\parallel \le \parallel A\parallel +\parallel B\parallel $ for all $A,B\in \U0001d524\U0001d529(n,\mathbf{R})$ ;

$\parallel \lambda A\parallel =\lambda \parallel A\parallel $ for all $A\in \U0001d524\U0001d529(n,\mathbf{R})$ and $\lambda \in \mathbf{R}$ ;

$\parallel A\parallel =0$ if and only if $A=0$ .
We will focus on two particular matrix norms.
L^{1} norm
The ${L}^{1}$ norm of $A$ , is ${\parallel A\parallel}_{{L}^{1}}={\sum}_{i,j}{A}_{i,j}$ . So a matrix is "big in the ${L}^{1}$ norm" if it has an entry with large absolute value.
If ${\parallel A\parallel}_{{L}^{1}}=0$ then ${A}_{ij}=0$ for all $i,j$ , so $A=0$ . If we rescale $A$ by $\lambda $ then all entries are scaled by $\lambda $ , so ${\parallel \lambda A\parallel}_{{L}^{1}}=\lambda {\parallel A\parallel}_{{L}^{1}}$ . The triangle inequality can be deduced by applying the triangle inequality for absolute values to each matrix entry.
Operator norm
The
Another way to think of this is: you take the unit sphere in ${\mathbf{R}}^{n}$ , you apply $A$ to obtain an ellipsoid, and you take the furthest distance of a point on this ellipsoid from the origin.
Take $A=\left(\begin{array}{cc}\hfill 1\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 2\hfill \end{array}\right)$ . The $x$ axis is fixed and the $y$ axis is rescaled by a factor of 2. The unit circle therefore becomes an ellipse with height 2 and width 1. The furthest point from the origin is a distance 2 from the origin (either the north or south pole), so ${\parallel A\parallel}_{op}=2$ .
To see that the operator norm is a norm, note that:

If ${\parallel A\parallel}_{op}=0$ then all points on the ellipsoid image of the unit sphere are a distance 0 from the origin, so the image of the unit sphere is the origin. Therefore $A=0$ .

If you rescale $A$ by $\lambda $ then the lengths of the vectors $Au$ over which we are taking the maximum are rescaled by $\lambda $ , so ${\parallel \lambda A\parallel}_{op}=\lambda {\parallel A\parallel}_{op}$ .

To prove the triangle inequality, note that $(A+B)u=Au+Bu\le Au+Bu\le {\parallel A\parallel}_{op}+{\parallel B\parallel}_{op}$ . Since ${\parallel A+B\parallel}_{op}=\mathrm{max}\{(A+B)u:u=1\}$ , this shows ${\parallel A+B\parallel}_{op}\le {\parallel A\parallel}_{op}+{\parallel B\parallel}_{op}$ .
Lipschitz equivalence
Any two matrix norms on $\U0001d524\U0001d529(n,\mathbf{R})$ are Lipschitz equivalent. For the two norms we've met so far, this means there exist constants ${C}_{1}$ , ${C}_{2}$ , ${D}_{1}$ , ${D}_{2}$ (independent of $M$ ) such that: $${C}_{1}{\parallel A\parallel}_{{L}^{1}}\le {\parallel A\parallel}_{op}\le {C}_{2}{\parallel A\parallel}_{{L}^{1}}$$ and $${D}_{1}{\parallel A\parallel}_{op}\le {\parallel A\parallel}_{{L}^{1}}\le {D}_{2}{\parallel A\parallel}_{op}.$$
These inequalities will be useful in the proof of convergence: sometimes it's easier to bound one or other, and this is telling us that if you can bound one then you can bound the other. It also tells us that the notion of convergence we defined (where we had implicitly picked a matrix norm) doesn't depend on which norm we picked.
We won't prove the lemma, but for those who are interested, it's true more generally that any two norms on a finitedimensional vector space are Lipschitz equivalent. This fails for infinitedimensional vector spaces, but we're working with $\U0001d524\U0001d529(n,\mathbf{R})$ which is ${n}^{2}$ dimensional.
Properties of the operator norm
We will now prove some useful properties of the operator norm. Since we're only focusing on this norm, we will drop the subscript $op$ and write it as $\parallel A\parallel $ .

For any vector $v\in {\mathbf{R}}^{n}$ , $Av\le \parallel A\parallel v$ .

$\parallel AB\parallel \le \parallel A\parallel \parallel B\parallel $ .

$\parallel {A}^{m}\parallel \le {\parallel A\parallel}^{m}$ .

Write $v$ as $vu$ for some unit vector $u$ . Then $Av=vAu\le \parallel A\parallel v$ because $Au\le \parallel A\parallel $ by definition of the operator norm.

Let $u$ be a unit vector. We have $ABu\le \parallel A\parallel Bu\le \parallel A\parallel \parallel B\parallel u=\parallel A\parallel \parallel B\parallel $ using the first part of the lemma twice. This shows that the things we are maximising over to get $\parallel AB\parallel $ are all less than $\parallel A\parallel \parallel B\parallel $ , so $\parallel AB\parallel \le \parallel A\parallel \parallel B\parallel $ .

Lastly, $\parallel {A}^{m}\parallel \le \parallel {A}^{m1}\parallel \parallel A\parallel \le \mathrm{\cdots}\le {\parallel A\parallel}^{m}$ using the previous part of the lemma inductively.