# Row-major or Column-major? Some notes about matrix convention

## The mathematical conventions

Generally speaking, when we want to render something to screen that means we need to calculate the color of each pixel of the final screen in front of us, then ask some hardware to fulfill those pixels with the color we specified. But typically what we have in disk are just some mesh and texture data files, which contain some information about the object we want to render, like how many vertexes they have, where the vertex’s position is in their own coordinate space, what the color of that vertex should look like and etc.. So, how to manipulate them until we get final color of the screen’s pixel? Of course we need some math, I assume you had already learned some linear algebra in theory and could understand what a matrix it is (or you could reference Wikipedia).

First of all, let’s review some theoretical things, the 4×4 matrix in Row-major mathematical convention is represented as (let’s count from 0 for convenient, also read the row first, then read the column):

\begin{bmatrix} a_{00} & a_{01} & a_{02} & a_{03} \\ a_{10} & a_{11} & a_{12} & a_{13} \\ a_{20} & a_{21} & a_{22} & a_{23} \\ a_{30} & a_{31} & a_{32} & a_{33} \\ \end{bmatrix}

In this convention we assume the right side (y in a_{xy})of the index (or subscript) of the element is counted always at first, and you can see the Row-major here means the elements are counted in the row first. This convention is used widely but still not all of situations, I’ll explain later.

And since we are focusing on the game development, typically we just use matrix to do linear transformation with vector (or multiplate different matrices one by one to get another matrix), and in most cases it’s just a 3-dimensional vector, but for cache-friendly purpose (see Wikipedia, alternate approach is to use additional padding data to make the alignment is the power of 2) we’d intense to use 4-dimensional vector instead, and it could also give us the convenience to represent a point or a direction with just modifing one of the components (set the additional component to 1.0 as a point, 0.0 as a direction).

Then we could define the vector4 in Column-major mathematical convention is represented as:

\begin{bmatrix} x \\ y \\ z \\ w \\ \end{bmatrix}

As you can see, it’s just a matrix with only one column, the Column-major here means the elements are listed as a 4×1 matrix instead of the 1×4 matrix.

Typically when we want to do the matrix-vector multiplication with these two mathematical representations, we call it right/post-multiplication, means the vector is on the right side or in the post position of the matrix in the equation (restricted by the definition of matrix multiplication, we need to match the dimension between two multiplicators), and as it listed in the demonstration below, in theory, if we calculated the final result’s elements one by one following the order of elements in vector4, it needs to access each row of the matrix first then each element in the row.

4×4 matrix * vector4 (4×1 matrix) :

\begin{bmatrix} x^{‘} = a_{00} * x + a_{01} * y + a_{02} * z + a_{03} * w \\ y^{‘} = a_{10} * x + a_{11} * y + a_{12} * z + a_{13} * w \\ z^{‘} = a_{20} * x + a_{21} * y + a_{22} * z + a_{23} * w \\ w^{‘} = a_{30} * x + a_{31} * y + a_{32} * z + a_{33} * w \\ \end{bmatrix}

## In memory

If you implement it with C/C++, it’s cache-friendly with Row-major memory layout; for Fortan/Matlab, it’s cache-friendly with Column-major memory layout.

So here we introduced another term, Row-major memory layout, what is it? As different software frameworks and people have different definitions for it, I will choose to follow one definition based on some nice blog as my definition. For example, I use C++ at the most of times, and if I use 2D array like m[A][B] in C++ to store a matrix, the memory would always store the M[0][0] to m[0][B – 1] continuously then m[1][0] to m[1][B – 1] until m[A – 1][0] to m[A – 1][B – 1]. And if you just try to assign the matrix data into this array, you could find two ways to assign it, set m[x – 1][y – 1] to value of a_{xy} or set m[x – 1][y – 1] to value of a_{yx}, and that’s where the term came from. If you set m[x – 1][y – 1] to the value of a_{xy}, that means if anybody iterated the elements of the array continuously they would access the data of the whole row of a matrix then another one row until the last of rows.

Here the 2D array in memory looks like (in C/C++):

0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F
m[0][0] m[0][1] m[0][2] m[0][3] m[1][0] m[1][1] m[1][2] m[1][3] m[2][0] m[2][1] m[2][2] m[2][3] m[3][0] m[3][1] m[3][2] m[3][3]

Be careful here, after we defined Row-major memory layout, actually we still could store the matrix data in any kind of orders, for example we could just store a 4×4 matrix in Row-major mathematical convention in C++ 2D array with Row-major memory layout naturally like:

0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F
m[0][0] m[0][1] m[0][2] m[0][3] m[1][0] m[1][1] m[1][2] m[1][3] m[2][0] m[2][1] m[2][2] m[2][3] m[3][0] m[3][1] m[3][2] m[3][3]
a_{00} a_{01} a_{02} a_{03} a_{10} a_{11} a_{12} a_{13} a_{20} a_{21} a_{22} a_{23} a_{30} a_{31} a_{32} a_{33}

or store the column first, that means we choose to use Column-major memory layout in C++:

0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F
m[0][0] m[0][1] m[0][2] m[0][3] m[1][0] m[1][1] m[1][2] m[1][3] m[2][0] m[2][1] m[2][2] m[2][3] m[3][0] m[3][1] m[3][2] m[3][3]
a_{00} a_{10} a_{20} a_{30} a_{01} a_{11} a_{21} a_{31} a_{02} a_{12} a_{22} a_{32} a_{03} a_{13} a_{23} a_{33}

As far as we just discussed the convention itself until now, there is no problem to choose which way to store it, you could even store it in reverse order or some random order, that’s just a mapping relationship between the matrix data and the memory location, until we really write the code in our project we don’t need to worry about which one should we choose.

## Alternative approaches

Similar, we could define a vector4 in Row-major mathematical convention as:

\begin{bmatrix} x & y & z & w \\ \end{bmatrix}

it’s just a 1×4 matrix with only one row.
And then we get a term as left/pre-multiplication. And for C/C++, it’s cache-friendly with Column-major memory layout; for Fortan/Matlab, it’s cache-friendly with Row-major memory layout.

vector4 (1×4 matrix) * matrix4x4:

\begin{bmatrix} x^{‘} = x * a_{00} + y * a_{10} + z * a_{20} + w * a_{30} \\ y^{‘} = x * a_{01} + y * a_{11} + z * a_{21} + w * a_{31} \\ z^{‘} = x * a_{02} + y * a_{12} + z * a_{22} + w * a_{32} \\ w^{‘} = x * a_{03}+ y * a_{13} + z * a_{23} + w * a_{33} \\ \end{bmatrix}

Again, there is no doubt we need to access the elements in each column of matrix continously, so the best memory layout is:

0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F
m[0][0] m[0][1] m[0][2] m[0][3] m[1][0] m[1][1] m[1][2] m[1][3] m[2][0] m[2][1] m[2][2] m[2][3] m[3][0] m[3][1] m[3][2] m[3][3]
a_{00} a_{10} a_{20} a_{30} a_{01} a_{11} a_{21} a_{31} a_{02} a_{12} a_{22} a_{32} a_{03} a_{13} a_{23} a_{33}

So finally the vector4 mathematical convention and the memory layout have 2 x 2 = 4 different combinations:
vector4 in Column-major mathematical convention + Column-major memory layout
vector4 in Column-major mathematical convention + Row-major memory layout
vector4 in Row-major mathematical convention + Row-major memory layout
vector4 in Row-major mathematical convention + Column-major memory layout

In practice, because the mapping relationship of the index of the array and the index of the element in the matrix, whether you choose Row-major memory layout or Column-major memory layout, choose a corresponding cache-friendly vector4 mathematical convention would have the same efficiency, vice versa. And that means we need to avoid using the combinations I marked with different colors.

## Application examples

Let’s have a look at some examples:

If we want to represent a vector4 in Column-major mathematical convention to a translation matrix, we could write it in paper as:

\begin{bmatrix} T_x \\ T_y \\ T_z \\ T_w \\ \end{bmatrix}

Then the matrix is:

\begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & T_w \\ \end{bmatrix}

And let’s define a vector4 as the original point before the translation process,

\begin{bmatrix} x \\ y \\ z \\ w \\ \end{bmatrix}

Then after the translation the result vector4 is:

\begin{bmatrix} x^{‘} = 1 * x + 0 * y + 0 * z + T_x * w = x + T_x * w \\ y^{‘} = 0 * x + 1 * y + 0 * z + T_y * w = y + T_y * w \\ z^{‘} = 0 * x + 0 * y + 1 * z + T_z * w = z + T_z * w \\ w^{‘} = 0 * x + 0 * y + 0 * z + T_w * w = T_w * w \\ \end{bmatrix}

Because the {w} component represents the length of the projection line in homogeneous/projective coordinate, let’s assume we’ve already done the Perspective Division and then {w} = 1, the previous calculation could be simplified as:

Translation vector4:

\begin{bmatrix} T_x \\ T_y \\ T_z \\ 1 \\ \end{bmatrix}

Translation matrix:

\begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

The original point:

\begin{bmatrix} x \\ y \\ z \\ 1 \\ \end{bmatrix}

The result point:

\begin{bmatrix} x^{‘} = 1 * x + 0 * y + 0 * z + T_x * 1 = x + T_x \\ y^{‘} = 0 * x + 1 * y + 0 * z + T_y * 1 = y + T_y \\ z^{‘} = 0 * x + 0 * y + 1 * z + T_z * 1 = z + T_z \\ w^{‘} = 0 * x + 0 * y + 0 * z + 1 * 1 = 1 \\ \end{bmatrix}

And again, of course in theory here we need to access each row first, then in C++ it’s more conveinent to use Row-major memory layout because we will read the 2D array’s index as same as the math convention’s index and won’t have too much cache-missing problem, but also you could use Column-major memory layout, just need to flip everything in your mind and lose some CPU cycles.

And the same, if we choose to use vector4 in Row-major mathematical convention instead, finally we would find it’s more rational to use Column-major memory layout, but still could use Row-major memory layout.

But what if we want to use both vector4 in Row-major mathematical convention and vector4 in Column-major mathematical convention and want to have the same cache-friendly performance in C++? How the code should be written? Actually, it’s not so complex, if you understood what I talked previously, you may figure it out by yourself. As we defined two mathematical conventions for vector4 before, since it is just a specialized matrix, we could think matrix mathematical convention could be Row-major and Column-major too, that’s why I emphasized the 4×4 matrix in Row-major mathematical convention at very beginning, we represent the 4×4 matrix in Column-major mathematical convention, and flip everything in theory we discussed before, then put them in the cache-friendly memory layout, that’s all!

And below is how the mathematical representation looks like, remember we still read the row first, then read the column:

\begin{bmatrix} a_{00} & a_{10} & a_{20} & a_{30} \\ a_{01} & a_{11} & a_{21} & a_{31} \\ a_{02} & a_{12} & a_{22} & a_{32} \\ a_{03} & a_{13} & a_{23} & a_{33} \\ \end{bmatrix}

And now the vector4 in Column-major mathematical convention is:

\begin{bmatrix} x = a_{00} & y = a_{10} & z = a_{20} & w= a_{30} \\ \end{bmatrix}

And the translation matrix is:

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ T_x & T_y & T_z & 1 \\ \end{bmatrix}

Now the {T_x}{T_y}{T_z} (and {T_w}) is a_{03}a_{13}a_{23}（And a_{33}

0x0000 0x0001 0x0002 0x0003 0x0004 0x0005 0x0006 0x0007 0x0008 0x0009 0x000A 0x000B 0x000C 0x000D 0x000E 0x000F
m[0][0] m[0][1] m[0][2] m[0][3] m[1][0] m[1][1] m[1][2] m[1][3] m[2][0] m[2][1] m[2][2] m[2][3] m[3][0] m[3][1] m[3][2] m[3][3]
a_{00} a_{01} a_{02} a_{03} a_{10} a_{11} a_{12} a_{13} a_{20} a_{21} a_{22} a_{23} a_{30} a_{31} a_{32} a_{33}

Actually as you can see here, nothing changed in memory, exactly what changed is our mathematical convention and our mind, and then I feel it’s “meaningless” to distinguish between “Row-major” or “Column-major” inside C++, and this should be the reason why some people call C++ as a “Row-major” language, but still after all these red tape we could have a more clearly point of view before we start to write something.

But in practice, due to the historical reason of OpenGL (see OpenGL FAQ 9.005), it uses 4×4 matrix in Column-major mathematical convention and a memory layout corresponding to Column-major memory layout in C++ (as the document said

The translation components occupy the 13th, 14th, and 15th elements of the 16-element matrix

, means m[3][0] to m[3][3] are {T_x}{T_y}{T_z} and {T_w}), and for the sake of different mathematical conventions, glUniformMatrix4fv function has a parameter at third position to indicate if we want to ask the implementation library to flip the input matrix or not. if we choose to use the same mathematical convention and memory layout as OpenGL used then we just set it to GL_FALSE, means no flipping. And if we choose to use 4×4 matrix in Row-major mathematical convention but didn’t change the memory layout in C++, we need to set it to GL_TRUE to flip the input matrix, same as the situation when just change the memory layout but didn’t change the mathematical convention.

Actually it is really a confused topic, because how we read and count how we use and understand the terms finally influence our implementation, in my project I decided to support both two different conventions and until now it works fine, but in large scale products I assume that would become a problem if didn’t make the clarification well and clearly.

## Code samples

Some code samples here:

//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
template
auto TVec4::toTranslationMatrix(const TVec4& rhs) -> TMat4
{
// @TODO: replace with SIMD impl
TMat4 l_m;

l_m.m[0][0] = one;
l_m.m[1][1] = one;
l_m.m[2][2] = one;
l_m.m[3][0] = rhs.x;
l_m.m[3][1] = rhs.y;
l_m.m[3][2] = rhs.z;
l_m.m[3][3] = one;

return l_m;
}
//Row-major memory layout
#elif defined (USE_ROW_MAJOR_MEMORY_LAYOUT)
template
auto toTranslationMatrix(const TVec4& rhs) -> TMat4
{
// @TODO: replace with SIMD impl
TMat4 l_m;

l_m.m[0][0] = one;
l_m.m[0][3] = rhs.x;
l_m.m[1][1] = one;
l_m.m[1][3] = rhs.y;
l_m.m[2][2] = one;
l_m.m[2][3] = rhs.z;
l_m.m[3][3] = one;

return l_m;
}
#endif
//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
template
auto toRotationMatrix(const TVec4& rhs) -> TMat4
{
// @TODO: replace with SIMD impl
TMat4 l_m;

l_m.m[0][0] = (one -two * rhs.y * rhs.y - two * rhs.z * rhs.z);
l_m.m[0][1] = (two * rhs.x * y + two * rhs.z * rhs.w);
l_m.m[0][2] = (two * rhs.x * rhs.z - two * rhs.y * rhs.w);
l_m.m[0][3] = (T());

l_m.m[1][0] = (two * rhs.x * rhs.y - two * rhs.z * rhs.w);
l_m.m[1][1] = (one -two * rhs.x * rhs.x - two * rhs.z * rhs.z);
l_m.m[1][2] = (two * rhs.y * rhs.z + two * rhs.x * rhs.w);
l_m.m[1][3] = (T());

l_m.m[2][0] = (two * rhs.x * rhs.z + two * rhs.y * rhs.w);
l_m.m[2][1] = (two * rhs.y * rhs.z - two * rhs.x * rhs.w);
l_m.m[2][2] = (one -two * rhs.x * rhs.x - two * rhs.y * rhs.y);
l_m.m[2][3] = (T());

l_m.m[3][0] = (T());
l_m.m[3][1] = (T());
l_m.m[3][2] = (T());
l_m.m[3][3] = (one);

return l_m;
}
//Row-major memory layout
#elif defined ( USE_ROW_MAJOR_MEMORY_LAYOUT)
template
auto toRotationMatrix(const TVec4& rhs) -> TMat4
{
// @TODO: replace with SIMD impl
TMat4 l_m;

l_m.m[0][0] = (one -two * rhs.y * rhs.y - two * rhs.z * rhs.z);
l_m.m[0][1] = (two * rhs.x * rhs.y - two * rhs.z * rhs.w);
l_m.m[0][2] = (two * rhs.x * rhs.z + two * rhs.y * rhs.w);
l_m.m[0][3] = (T());

l_m.m[1][0] = (two * rhs.x * rhs.y + two * rhs.z * rhs.w);
l_m.m[1][1] = (one -two * rhs.x * rhs.x - two * rhs.z * rhs.z);
l_m.m[1][2] = (two * rhs.y * rhs.z - two * rhs.x * rhs.w);
l_m.m[1][3] = (T());

l_m.m[2][0] = (two * rhs.x * rhs.z - two * rhs.y * rhs.w);
l_m.m[2][1] = (two * rhs.y * rhs.z + two * rhs.x * rhs.w);
l_m.m[2][2] = (one -two * rhs.x * rhs.x - two * rhs.y * rhs.y);
l_m.m[2][3] = (T());

l_m.m[3][0] = (T());
l_m.m[3][1] = (T());
l_m.m[3][2] = (T());
l_m.m[3][3] = one;

return l_m;
}
#endif

## Further more

For the Multiplication algorithm of matrix, the naive implemetations is O(n^3), and there are lots of optimized algorithms (e.g. Solvay Strassen algorithm O(n^{2.807})/Coppersmith-Winograd algorithm O(n^{2.3737})/Stochers-Williams algorithm O(n^{2.3729}) could get a time complexity around O(n^{2.4}).

For transpose could consider about in-situ matrix transposition.

And for inverse, Gaussian Elimination is O(n^3), could choose LU decomposition, and also there are lots of optimized algorithms.