Generally speaking, when we want to “render” something to the screen, it means we need to calculate the “color” of each pixel on the refreshing LCD in front of us, and then ask some hardware to fulfill those pixels with the color we specified. But typically what we have in our hard disk are just some mesh and texture data files, which only contain some basic information about the object we want to render, like how many vertexes they have, where the vertex’s position is in their own coordinate space, what the “color” of that vertex should look like and etc.. So, how to manipulate them until we get the final color that we could output to the screen’s pixel? Of course, we need some math, I assume you had already learned some basic linear algebra in theory and thus could understand what a matrix it is (or you could reference Wikipedia here for some sorts of overview).

First of all, let’s have a review about some theoretical things, the 4x4 matrix in Row-major mathematical convention is represented as (let’s count from 0 for convenient, also **read** the row first, then **read** the column):

$\begin{bmatrix}
a_{00} & a_{01} & a_{02} & a_{03} \\
a_{10} & a_{11} & a_{12} & a_{13} \\
a_{20} & a_{21} & a_{22} & a_{23} \\
a_{30} & a_{31} & a_{32} & a_{33} \\
\end{bmatrix}$

In this convention, we assume the right side ($y$ in $a_{xy}$)of the index (or call it “subscript”) of the element is **counted** always at first, and as you can see the row-major here means the elements are **counted** in the row first. This convention is used widely among the world but still not all the people choose to adopt it, I’ll give an opposite example later.

And since we are focusing on the game development, typically we just use a matrix to do some fairly limited linear transformation with another vector (or multiplicate different matrices one by one to get a transformation matrix). And in most cases the vector we cared about is just a 3-dimensional vector. But for cache-friendly purpose (have a look at Wikipedia, the alternate approach is using the additional padding data to make the alignment is the power of 2) and to utilize the power of the ~~Homo Sapiens coordinates~~(Homogeneous coordinates), we’d intense to use 4-dimensional vector instead, and it would bring us the convenience to represent a point or a direction by just modifying one of the components (set the additional w component to 1.0 as a point, 0.0 as a direction).

Then we could define the vector4 in Column-major mathematical convention is represented as:

$\begin{bmatrix}
x \\
y \\
z \\
w \\
\end{bmatrix}$

As you can see, after all, it’s just a matrix with only one column, the column-major here means the elements are listed as a 4x1 matrix instead of the 1x4 matrix.

Typically when we want to do the matrix-vector multiplication with these two mathematical representations, we name it as the **right**/**post**-multiplication, indicate that the vector is on the **right** side or in the **post** position of the matrix in the equation (restricted by the definition of matrix multiplication, we need to match the dimension between two multiplicators). And as what I listed in the demonstration below, if we calculated the final result’s elements one by one by following the order of elements in vector4, we need to access each row of the matrix first then each element in the row.

4x4 matrix * vector4 (4x1 matrix) :

$\begin{bmatrix}
x^{'} = a_{00} * x + a_{01} * y + a_{02} * z + a_{03} * w \\
y^{'} = a_{10} * x + a_{11} * y + a_{12} * z + a_{13} * w \\
z^{'} = a_{20} * x + a_{21} * y + a_{22} * z + a_{23} * w \\
w^{'} = a_{30} * x + a_{31} * y + a_{32} * z + a_{33} * w \\
\end{bmatrix}$

If you implement the theory stuffs that we talk about with some C/C++, it would be quite cache-friendly with Row-major memory layout. (And for Fortan/Matlab, it’s cache-friendly with Column-major memory layout.)

Why? So here again we introduced another term came from no where, Row-major memory layout, what is it? As different software frameworks and people have different definitions for it, I will choose to follow one definition based on some nice blog as my definition. For example, I use C++ at the most of times to work out some 01010101 projects, and if I use * 2D array* like

`m[A][B]`

in C++ to represent and store a $A*B$ dimension matrix, the final memory would always laid as the `m[0][0]`

to `m[0][B - 1]`

continuously and then `m[1][0]`

to `m[1][B - 1]`

until `m[A - 1][0]`

to the last element `m[A - 1][B - 1]`

.The 2D array in memory looks like (in C/C++, let’s use a quite small virtual address table, welcome to 70’s???):

0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

m[0][0] | m[0][1] | m[0][2] | m[0][3] | m[1][0] | m[1][1] | m[1][2] | m[1][3] | m[2][0] | m[2][1] | m[2][2] | m[2][3] | m[3][0] | m[3][1] | m[3][2] |

For this kind of compiler interpretation order of the operator `[]`

, some people would say, C/C++ is Row-major memory layout language. I would rather deprecate this kind of understanding since it would have some conflicts with my own definition because, you could achieve Column-major memory layout in C/C++ too if you just change your mind in deed! Let’s explain just a little bit later.

And if you want to assign the matrix data into this array, you could choose two ways to assign it: set `m[x - 1][y - 1]`

to value of $a_{xy}$ or set `m[x - 1][y - 1]`

to value of $a_{yx}$. Naturally, if you want to match the index of the theoretical matrix and the implemented 2D array, you would choose the first way and that’s how the term came from in my opinion. If you set `m[x - 1][y - 1]`

to the value of $a_{xy}$, that means if anybody iterated the elements of the array continuously they would access the data of the first whole row of the matrix then the second row until the last of rows, in another word, *row-major*.

But be careful! After we defined Row-major memory layout, it doesn’t means we have to choose the only one way to assign the matrix data, we still could store them in any kind of orders we like. For example as I’ve already talked above, we could just store a 4x4 matrix in Row-major mathematical convention in C++ 2D array with Row-major memory layout like:

0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

m[0][0] | m[0][1] | m[0][2] | m[0][3] | m[1][0] | m[1][1] | m[1][2] | m[1][3] | m[2][0] | m[2][1] | m[2][2] | m[2][3] | m[3][0] | m[3][1] | m[3][2] |

$a_{00}$ | $a_{01}$ | $a_{02}$ | $a_{03}$ | $a_{10}$ | $a_{11}$ | $a_{12}$ | $a_{13}$ | $a_{20}$ | $a_{21}$ | $a_{22}$ | $a_{23}$ | $a_{30}$ | $a_{31}$ | $a_{32}$ |

or alternatively, we could store the column of the matrix first, that means we choose to use a kind of Column-major memory layout in C++:

0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

m[0][0] | m[0][1] | m[0][2] | m[0][3] | m[1][0] | m[1][1] | m[1][2] | m[1][3] | m[2][0] | m[2][1] | m[2][2] | m[2][3] | m[3][0] | m[3][1] | m[3][2] |

$a_{00}$ | $a_{10}$ | $a_{20}$ | $a_{30}$ | $a_{01}$ | $a_{11}$ | $a_{21}$ | $a_{31}$ | $a_{02}$ | $a_{12}$ | $a_{22}$ | $a_{32}$ | $a_{03}$ | $a_{13}$ | $a_{23}$ |

As far as we just discussed the convention itself until now, there is nothing ahead that would obstacle us to pick up a preferred way, and actually, you could even store the matrix in memory in reverse order or some crazy encrypted random order. After all, it’s just a mapping relationship between the matrix data and the memory location. We really don’t need to worry about which convention should we choose until the moment we start to write the code and build them into a binary executable file in our project.

Similar, we could define a vector4 in Row-major mathematical convention now as:

$\begin{bmatrix}
x & y & z & w \\
\end{bmatrix}$

it’s just a 1x4 matrix with only one row.
And then we could get a term as **left**/**pre**-multiplication.

vector4 (1x4 matrix) * matrix4x4:

$\begin{bmatrix}
x^{'} = x * a_{00} + y * a_{10} + z * a_{20} + w * a_{30} \\
y^{'} = x * a_{01} + y * a_{11} + z * a_{21} + w * a_{31} \\
z^{'} = x * a_{02} + y * a_{12} + z * a_{22} + w * a_{32} \\
w^{'} = x * a_{03}+ y * a_{13} + z * a_{23} + w * a_{33} \\
\end{bmatrix}$

Again, there is no doubt we need to access the elements in each column of matrix continuously, so the more efficient memory layout is:

0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

m[0][0] | m[0][1] | m[0][2] | m[0][3] | m[1][0] | m[1][1] | m[1][2] | m[1][3] | m[2][0] | m[2][1] | m[2][2] | m[2][3] | m[3][0] | m[3][1] | m[3][2] |

$a_{00}$ | $a_{10}$ | $a_{20}$ | $a_{30}$ | $a_{01}$ | $a_{11}$ | $a_{21}$ | $a_{31}$ | $a_{02}$ | $a_{12}$ | $a_{22}$ | $a_{32}$ | $a_{03}$ | $a_{13}$ | $a_{23}$ |

And after we investigate some programming languages we would get the conclusion: for C/C++, it’s cache-friendly with Column-major memory layout. (And for Fortan/Matlab, it’s cache-friendly with Row-major memory layout, but who cares about cache friendly in Matlab??? Come on.)

So finally the vector4 mathematical convention and the memory layout have 2 x 2 = 4 different combinations: vector4 in Column-major mathematical convention + Column-major memory layout vector4 in Column-major mathematical convention + Row-major memory layout vector4 in Row-major mathematical convention + Row-major memory layout vector4 in Row-major mathematical convention + Column-major memory layout

And practically speaking, w.r.t. the mapping relationship between the index of the array and the index of the element in the matrix, whether you choose Row-major memory layout or Column-major memory layout, it’s better to choose a corresponding cache-friendly vector4 mathematical convention at the same time. And that means we need to avoid using the combinations I marked with different colors above because who want some stalls in their precious CPU cycles?

Let’s have a look at some examples for better comprehension:

If we want to extend a vector4 in Column-major mathematical convention to a translation matrix, we could write it in paper as:

$\begin{bmatrix}
T_x \\
T_y \\
T_z \\
T_w \\
\end{bmatrix}$

Then the matrix is:

$\begin{bmatrix}
1 & 0 & 0 & T_x \\
0 & 1 & 0 & T_y \\
0 & 0 & 1 & T_z \\
0 & 0 & 0 & T_w \\
\end{bmatrix}$

And let’s define a vector4 as the original point before the translation process,

$\begin{bmatrix}
x \\
y \\
z \\
w \\
\end{bmatrix}$

Then after the translation the result vector4 is:

$\begin{bmatrix}
x^{'} = 1 * x + 0 * y + 0 * z + T_x * w = x + T_x * w \\
y^{'} = 0 * x + 1 * y + 0 * z + T_y * w = y + T_y * w \\
z^{'} = 0 * x + 0 * y + 1 * z + T_z * w = z + T_z * w \\
w^{'} = 0 * x + 0 * y + 0 * z + T_w * w = T_w * w \\
\end{bmatrix}$

Because the ${w}$ component represents the length of the projection line in homogeneous/projective coordinate, let’s assume we’ve already done the Perspective Division and then ${w} = 1$, the previous calculation could be simplified as:

Translation vector4:

$\begin{bmatrix}
T_x \\
T_y \\
T_z \\
1 \\
\end{bmatrix}$

Translation matrix:

$\begin{bmatrix}
1 & 0 & 0 & T_x \\
0 & 1 & 0 & T_y \\
0 & 0 & 1 & T_z \\
0 & 0 & 0 & 1 \\
\end{bmatrix}$

The original point:

$\begin{bmatrix}
x \\
y \\
z \\
1 \\
\end{bmatrix}$

The result point:

$\begin{bmatrix}
x^{'} = 1 * x + 0 * y + 0 * z + T_x * 1 = x + T_x \\
y^{'} = 0 * x + 1 * y + 0 * z + T_y * 1 = y + T_y \\
z^{'} = 0 * x + 0 * y + 1 * z + T_z * 1 = z + T_z \\
w^{'} = 0 * x + 0 * y + 0 * z + 1 * 1 = 1 \\
\end{bmatrix}$

And again, of course here we need to access each row first, then in C++ it’s more convenient to use Row-major memory layout because we will **read** the 2D array’s index as same as the math convention’s index and won’t have too much cache-missing problem, but also you could use Column-major memory layout, just need to flip everything in your mind and lose some CPU cycles.

And the same, if we choose to use vector4 in Row-major mathematical convention instead, finally we would find it’s more rational to use Column-major memory layout, but still we could use Row-major memory layout.

But what if we want to use both vector4 in Row-major mathematical convention and vector4 in Column-major mathematical convention and want to have the same cache-friendly performance in C++? How the code should be written? Actually, it’s not so complex, if you understood what all I talked above thoroughly, you may figure it out by yourself. As we defined two mathematical conventions for vector4 before, since it is just a specialized matrix, we could think matrix mathematical convention could be row-major and column-major too, that’s why I emphasized the 4x4 matrix in Row-major mathematical convention at the very beginning of this article, we represent the 4x4 matrix in Column-major mathematical convention, and flip everything in theory we discussed before, then put them in the cache-friendly memory layout, that’s all!

And below is how the mathematical representation looks like, remember we still **read** the row first, then **read** the column:

$\begin{bmatrix}
a_{00} & a_{10} & a_{20} & a_{30} \\
a_{01} & a_{11} & a_{21} & a_{31} \\
a_{02} & a_{12} & a_{22} & a_{32} \\
a_{03} & a_{13} & a_{23} & a_{33} \\
\end{bmatrix}$

And now the vector4 in Column-major mathematical convention is:

$\begin{bmatrix}
x = a_{00} & y = a_{10} & z = a_{20} & w= a_{30} \\
\end{bmatrix}$

And the translation matrix is:

$\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
T_x & T_y & T_z & 1 \\
\end{bmatrix}$

Now the ${T_x}$、${T_y}$、${T_z}$ (and ${T_w}$) is $a_{03}$、$a_{13}$、$a_{23}$（And $a_{33}$）

And finally we get the best choice as:

0x0000 | 0x0001 | 0x0002 | 0x0003 | 0x0004 | 0x0005 | 0x0006 | 0x0007 | 0x0008 | 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x000E |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

m[0][0] | m[0][1] | m[0][2] | m[0][3] | m[1][0] | m[1][1] | m[1][2] | m[1][3] | m[2][0] | m[2][1] | m[2][2] | m[2][3] | m[3][0] | m[3][1] | m[3][2] |

$a_{00}$ | $a_{01}$ | $a_{02}$ | $a_{03}$ | $a_{10}$ | $a_{11}$ | $a_{12}$ | $a_{13}$ | $a_{20}$ | $a_{21}$ | $a_{22}$ | $a_{23}$ | $a_{30}$ | $a_{31}$ | $a_{32}$ |

Actually, as you can see here, nothing changed in memory, what changed exactly is our mathematical convention and our mind. And then I feel it’s “meaningless” to distinguish between “row-major” or “column-major” inside C++, and this should be the reason why some people call C++ as a “row-major” language. But still, we could have a more clear point of view after all of these red tapes just before we start to write something.

But in practice, unfortunately, due to the historical reason of OpenGL (see OpenGL FAQ 9.005), it uses 4x4 matrix in Column-major mathematical convention and a memory layout corresponding to Column-major memory layout in C/C++ (as the document said

The translation components occupy the 13th, 14th, and 15th elements of the 16-element matrix

, means `m[3][0]`

to `m[3][3]`

are ${T_x}$、${T_y}$、${T_z}$ and ${T_w}$). For the sake of different mathematical conventions, glUniformMatrix4fv function has a parameter at the third position to indicate if we want to ask the vendor’s implementation library to flip the input matrix or not. If we choose to use the same mathematical convention and memory layout as OpenGL initially defined then we just set it to GL_FALSE, means no flipping. And if we choose to use 4x4 matrix in Row-major mathematical convention but didn’t change the memory layout in C++, we need to set it to GL_TRUE to flip the input matrix, as same as the situation when we change the memory layout but don’t change the mathematical convention. If you are some messy guy like me who intended not to use DirectXMath and want to achieve a unified math library among different graphics APIs, you’d better have a caution about it.

Actually it is really a confusing topic, because how we **read** how we **count** how we use and understand the terms finally influence our implementation, in my game engine I decided to support both two different conventions and until now it works fine, but in large scale products I assume that would become a problem if people didn’t make the clarification well and clearly at the beginning.

Some code samples here:

```
//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
template<class T>
auto TVec4::toTranslationMatrix(const TVec4<T>& rhs) -> TMat4<T>
{
// @TODO: replace with SIMD impl
TMat4<T> l_m;
l_m.m[0][0] = one<T>;
l_m.m[1][1] = one<T>;
l_m.m[2][2] = one<T>;
l_m.m[3][0] = rhs.x;
l_m.m[3][1] = rhs.y;
l_m.m[3][2] = rhs.z;
l_m.m[3][3] = one<T>;
return l_m;
}
//Row-major memory layout
#elif defined (USE_ROW_MAJOR_MEMORY_LAYOUT)
template<class T>
auto toTranslationMatrix(const TVec4<T>& rhs) -> TMat4<T>
{
// @TODO: replace with SIMD impl
TMat4<T> l_m;
l_m.m[0][0] = one<T>;
l_m.m[0][3] = rhs.x;
l_m.m[1][1] = one<T>;
l_m.m[1][3] = rhs.y;
l_m.m[2][2] = one<T>;
l_m.m[2][3] = rhs.z;
l_m.m[3][3] = one<T>;
return l_m;
}
#endif
```

```
//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
template<class T>
auto toRotationMatrix(const TVec4<T>& rhs) -> TMat4<T>
{
// @TODO: replace with SIMD impl
TMat4<T> l_m;
l_m.m[0][0] = (one<T> -two<T> * rhs.y * rhs.y - two<T> * rhs.z * rhs.z);
l_m.m[0][1] = (two<T> * rhs.x * y + two<T> * rhs.z * rhs.w);
l_m.m[0][2] = (two<T> * rhs.x * rhs.z - two<T> * rhs.y * rhs.w);
l_m.m[0][3] = (T());
l_m.m[1][0] = (two<T> * rhs.x * rhs.y - two<T> * rhs.z * rhs.w);
l_m.m[1][1] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.z * rhs.z);
l_m.m[1][2] = (two<T> * rhs.y * rhs.z + two<T> * rhs.x * rhs.w);
l_m.m[1][3] = (T());
l_m.m[2][0] = (two<T> * rhs.x * rhs.z + two<T> * rhs.y * rhs.w);
l_m.m[2][1] = (two<T> * rhs.y * rhs.z - two<T> * rhs.x * rhs.w);
l_m.m[2][2] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.y * rhs.y);
l_m.m[2][3] = (T());
l_m.m[3][0] = (T());
l_m.m[3][1] = (T());
l_m.m[3][2] = (T());
l_m.m[3][3] = (one<T>);
return l_m;
}
//Row-major memory layout
#elif defined ( USE_ROW_MAJOR_MEMORY_LAYOUT)
template<class T>
auto toRotationMatrix(const TVec4<T>& rhs) -> TMat4<T>
{
// @TODO: replace with SIMD impl
TMat4<T> l_m;
l_m.m[0][0] = (one<T> -two<T> * rhs.y * rhs.y - two<T> * rhs.z * rhs.z);
l_m.m[0][1] = (two<T> * rhs.x * rhs.y - two<T> * rhs.z * rhs.w);
l_m.m[0][2] = (two<T> * rhs.x * rhs.z + two<T> * rhs.y * rhs.w);
l_m.m[0][3] = (T());
l_m.m[1][0] = (two<T> * rhs.x * rhs.y + two<T> * rhs.z * rhs.w);
l_m.m[1][1] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.z * rhs.z);
l_m.m[1][2] = (two<T> * rhs.y * rhs.z - two<T> * rhs.x * rhs.w);
l_m.m[1][3] = (T());
l_m.m[2][0] = (two<T> * rhs.x * rhs.z - two<T> * rhs.y * rhs.w);
l_m.m[2][1] = (two<T> * rhs.y * rhs.z + two<T> * rhs.x * rhs.w);
l_m.m[2][2] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.y * rhs.y);
l_m.m[2][3] = (T());
l_m.m[3][0] = (T());
l_m.m[3][1] = (T());
l_m.m[3][2] = (T());
l_m.m[3][3] = one<T>;
return l_m;
}
#endif
```

For the multiplication algorithm of matrix, the naive implementation is $O(n^3)$, and there are lots of optimized algorithms (e.g. Solvay Strassen algorithm $O(n^{2.807})$/Coppersmith-Winograd algorithm $O(n^{2.3737})$/Stochers-Williams algorithm $O(n^{2.3729})$ could bring a better time complexity around $O(n^{2.4})$.

For the transpose algorithm, we could consider about in-situ matrix transposition.

And for the inverse algorithm, Gaussian Elimination is $O(n^3)$, we could choose LU decomposition, and also there are lots of optimized algorithms which I don’t have time to investigate too much.

Hope one day I’ll finally sit down and optimize my super slow matrix inverse operation!