Row-Major or Column-Major? Some notes about matrix convention

The mathematical conventions

Generally speaking, when we want to “render” something to the screen, it means we need to calculate the “color” of each pixel on the refreshing LCD in front of us, and then ask some hardware to fulfill those pixels with the color we specified. But typically what we have in our hard disk are just some mesh and texture data files, which only contain some basic information about the object we want to render, like how many vertexes they have, where the vertex’s position is in their own coordinate space, what the “color” of that vertex should look like and etc.. So, how to manipulate them until we get the final color that we could output to the screen’s pixel? Of course, we need some math, I assume you had already learned some basic linear algebra in theory and thus could understand what a matrix it is (or you could reference Wikipedia here for some sorts of overview).

First of all, let’s have a review about some theoretical things, the 4x4 matrix in Row-major mathematical convention is represented as (let’s count from 0 for convenient, also read the row first, then read the column):

\begin{bmatrix} a_{00} & a_{01} & a_{02} & a_{03} \\ a_{10} & a_{11} & a_{12} & a_{13} \\ a_{20} & a_{21} & a_{22} & a_{23} \\ a_{30} & a_{31} & a_{32} & a_{33} \\ \end{bmatrix}

In this convention, we assume the right side ( $y$ in $a_{xy}$ )of the index (or call it “subscript”) of the element is counted always at first, and as you can see the row-major here means the elements are counted in the row first. This convention is used widely among the world but still not all the people choose to adopt it, I’ll give an opposite example later.

And since we are focusing on the game development, typically we just use a matrix to do some fairly limited linear transformation with another vector (or multiplicate different matrices one by one to get a transformation matrix). And in most cases the vector we cared about is just a 3-dimensional vector. But for cache-friendly purpose (have a look at Wikipedia, the alternate approach is using the additional padding data to make the alignment is the power of 2) and to utilize the power of the ~~Homo Sapiens coordinates~~(Homogeneous coordinates), we’d intense to use 4-dimensional vector instead, and it would bring us the convenience to represent a point or a direction by just modifying one of the components (set the additional w component to 1.0 as a point, 0.0 as a direction).

Then we could define the vector4 in Column-major mathematical convention is represented as:

\begin{bmatrix} x \\ y \\ z \\ w \\ \end{bmatrix}

As you can see, after all, it’s just a matrix with only one column, the column-major here means the elements are listed as a 4x1 matrix instead of the 1x4 matrix.

Typically when we want to do the matrix-vector multiplication with these two mathematical representations, we name it as the right/post-multiplication, indicate that the vector is on the right side or in the post position of the matrix in the equation (restricted by the definition of matrix multiplication, we need to match the dimension between two multiplicators). And as what I listed in the demonstration below, if we calculated the final result’s elements one by one by following the order of elements in vector4, we need to access each row of the matrix first then each element in the row.

4x4 matrix * vector4 (4x1 matrix) :

\begin{bmatrix} x^{'} = a_{00} * x + a_{01} * y + a_{02} * z + a_{03} * w \\ y^{'} = a_{10} * x + a_{11} * y + a_{12} * z + a_{13} * w \\ z^{'} = a_{20} * x + a_{21} * y + a_{22} * z + a_{23} * w \\ w^{'} = a_{30} * x + a_{31} * y + a_{32} * z + a_{33} * w \\ \end{bmatrix}

In memory

If you implement the theory stuffs that we talk about with some C/C++, it would be quite cache-friendly with Row-major memory layout. (And for Fortan/Matlab, it’s cache-friendly with Column-major memory layout.)

Why? So here again we introduced another term came from no where, Row-major memory layout, what is it? As different software frameworks and people have different definitions for it, I will choose to follow one definition based on some nice blog as my definition. For example, I use C++ at the most of times to work out some 01010101 projects, and if I use 2D array like m[A][B] in C++ to represent and store a $A*B$ dimension matrix, the final memory would always laid as the m[0][0] to m[0][B - 1] continuously and then m[1][0] to m[1][B - 1] until m[A - 1][0] to the last element m[A - 1][B - 1].

The 2D array in memory looks like (in C/C++, let’s use a quite small virtual address table, welcome to 70’s???):

0x0000	0x0001	0x0002	0x0003	0x0004	0x0005	0x0006	0x0007	0x0008	0x0009	0x000A	0x000B	0x000C	0x000D	0x000E
m[0][0]	m[0][1]	m[0][2]	m[0][3]	m[1][0]	m[1][1]	m[1][2]	m[1][3]	m[2][0]	m[2][1]	m[2][2]	m[2][3]	m[3][0]	m[3][1]	m[3][2]

For this kind of compiler interpretation order of the operator [], some people would say, C/C++ is Row-major memory layout language. I would rather deprecate this kind of understanding since it would have some conflicts with my own definition because, you could achieve Column-major memory layout in C/C++ too if you just change your mind in deed! Let’s explain just a little bit later.

And if you want to assign the matrix data into this array, you could choose two ways to assign it: set m[x - 1][y - 1] to value of $a_{xy}$ or set m[x - 1][y - 1] to value of $a_{yx}$ . Naturally, if you want to match the index of the theoretical matrix and the implemented 2D array, you would choose the first way and that’s how the term came from in my opinion. If you set m[x - 1][y - 1] to the value of $a_{xy}$ , that means if anybody iterated the elements of the array continuously they would access the data of the first whole row of the matrix then the second row until the last of rows, in another word, row-major.

But be careful! After we defined Row-major memory layout, it doesn’t means we have to choose the only one way to assign the matrix data, we still could store them in any kind of orders we like. For example as I’ve already talked above, we could just store a 4x4 matrix in Row-major mathematical convention in C++ 2D array with Row-major memory layout like:

0x0000	0x0001	0x0002	0x0003	0x0004	0x0005	0x0006	0x0007	0x0008	0x0009	0x000A	0x000B	0x000C	0x000D	0x000E
m[0][0]	m[0][1]	m[0][2]	m[0][3]	m[1][0]	m[1][1]	m[1][2]	m[1][3]	m[2][0]	m[2][1]	m[2][2]	m[2][3]	m[3][0]	m[3][1]	m[3][2]
$a_{00}$	$a_{01}$	$a_{02}$	$a_{03}$	$a_{10}$	$a_{11}$	$a_{12}$	$a_{13}$	$a_{20}$	$a_{21}$	$a_{22}$	$a_{23}$	$a_{30}$	$a_{31}$	$a_{32}$

or alternatively, we could store the column of the matrix first, that means we choose to use a kind of Column-major memory layout in C++:

0x0000	0x0001	0x0002	0x0003	0x0004	0x0005	0x0006	0x0007	0x0008	0x0009	0x000A	0x000B	0x000C	0x000D	0x000E
m[0][0]	m[0][1]	m[0][2]	m[0][3]	m[1][0]	m[1][1]	m[1][2]	m[1][3]	m[2][0]	m[2][1]	m[2][2]	m[2][3]	m[3][0]	m[3][1]	m[3][2]
$a_{00}$	$a_{10}$	$a_{20}$	$a_{30}$	$a_{01}$	$a_{11}$	$a_{21}$	$a_{31}$	$a_{02}$	$a_{12}$	$a_{22}$	$a_{32}$	$a_{03}$	$a_{13}$	$a_{23}$

As far as we just discussed the convention itself until now, there is nothing ahead that would obstacle us to pick up a preferred way, and actually, you could even store the matrix in memory in reverse order or some crazy encrypted random order. After all, it’s just a mapping relationship between the matrix data and the memory location. We really don’t need to worry about which convention should we choose until the moment we start to write the code and build them into a binary executable file in our project.

Alternative approaches

Similar, we could define a vector4 in Row-major mathematical convention now as:

\begin{bmatrix} x & y & z & w \\ \end{bmatrix}

it’s just a 1x4 matrix with only one row. And then we could get a term as left/pre-multiplication.

vector4 (1x4 matrix) * matrix4x4:

\begin{bmatrix} x^{'} = x * a_{00} + y * a_{10} + z * a_{20} + w * a_{30} \\ y^{'} = x * a_{01} + y * a_{11} + z * a_{21} + w * a_{31} \\ z^{'} = x * a_{02} + y * a_{12} + z * a_{22} + w * a_{32} \\ w^{'} = x * a_{03}+ y * a_{13} + z * a_{23} + w * a_{33} \\ \end{bmatrix}

Again, there is no doubt we need to access the elements in each column of matrix continuously, so the more efficient memory layout is:

0x0000	0x0001	0x0002	0x0003	0x0004	0x0005	0x0006	0x0007	0x0008	0x0009	0x000A	0x000B	0x000C	0x000D	0x000E
m[0][0]	m[0][1]	m[0][2]	m[0][3]	m[1][0]	m[1][1]	m[1][2]	m[1][3]	m[2][0]	m[2][1]	m[2][2]	m[2][3]	m[3][0]	m[3][1]	m[3][2]
$a_{00}$	$a_{10}$	$a_{20}$	$a_{30}$	$a_{01}$	$a_{11}$	$a_{21}$	$a_{31}$	$a_{02}$	$a_{12}$	$a_{22}$	$a_{32}$	$a_{03}$	$a_{13}$	$a_{23}$

And after we investigate some programming languages we would get the conclusion: for C/C++, it’s cache-friendly with Column-major memory layout. (And for Fortan/Matlab, it’s cache-friendly with Row-major memory layout, but who cares about cache friendly in Matlab??? Come on.)

So finally the vector4 mathematical convention and the memory layout have 2 x 2 = 4 different combinations: vector4 in Column-major mathematical convention + Column-major memory layout vector4 in Column-major mathematical convention + Row-major memory layout vector4 in Row-major mathematical convention + Row-major memory layout vector4 in Row-major mathematical convention + Column-major memory layout

And practically speaking, w.r.t. the mapping relationship between the index of the array and the index of the element in the matrix, whether you choose Row-major memory layout or Column-major memory layout, it’s better to choose a corresponding cache-friendly vector4 mathematical convention at the same time. And that means we need to avoid using the combinations I marked with different colors above because who want some stalls in their precious CPU cycles?

Application examples

Let’s have a look at some examples for better comprehension:

If we want to extend a vector4 in Column-major mathematical convention to a translation matrix, we could write it in paper as:

\begin{bmatrix} T_x \\ T_y \\ T_z \\ T_w \\ \end{bmatrix}

Then the matrix is:

\begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & T_w \\ \end{bmatrix}

And let’s define a vector4 as the original point before the translation process,

\begin{bmatrix} x \\ y \\ z \\ w \\ \end{bmatrix}

Then after the translation the result vector4 is:

\begin{bmatrix} x^{'} = 1 * x + 0 * y + 0 * z + T_x * w = x + T_x * w \\ y^{'} = 0 * x + 1 * y + 0 * z + T_y * w = y + T_y * w \\ z^{'} = 0 * x + 0 * y + 1 * z + T_z * w = z + T_z * w \\ w^{'} = 0 * x + 0 * y + 0 * z + T_w * w = T_w * w \\ \end{bmatrix}

Because the ${w}$ component represents the length of the projection line in homogeneous/projective coordinate, let’s assume we’ve already done the Perspective Division and then ${w} = 1$ , the previous calculation could be simplified as:

Translation vector4:

\begin{bmatrix} T_x \\ T_y \\ T_z \\ 1 \\ \end{bmatrix}

Translation matrix:

\begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

The original point:

\begin{bmatrix} x \\ y \\ z \\ 1 \\ \end{bmatrix}

The result point:

\begin{bmatrix} x^{'} = 1 * x + 0 * y + 0 * z + T_x * 1 = x + T_x \\ y^{'} = 0 * x + 1 * y + 0 * z + T_y * 1 = y + T_y \\ z^{'} = 0 * x + 0 * y + 1 * z + T_z * 1 = z + T_z \\ w^{'} = 0 * x + 0 * y + 0 * z + 1 * 1 = 1 \\ \end{bmatrix}

And again, of course here we need to access each row first, then in C++ it’s more convenient to use Row-major memory layout because we will read the 2D array’s index as same as the math convention’s index and won’t have too much cache-missing problem, but also you could use Column-major memory layout, just need to flip everything in your mind and lose some CPU cycles.

And the same, if we choose to use vector4 in Row-major mathematical convention instead, finally we would find it’s more rational to use Column-major memory layout, but still we could use Row-major memory layout.

But what if we want to use both vector4 in Row-major mathematical convention and vector4 in Column-major mathematical convention and want to have the same cache-friendly performance in C++? How the code should be written? Actually, it’s not so complex, if you understood what all I talked above thoroughly, you may figure it out by yourself. As we defined two mathematical conventions for vector4 before, since it is just a specialized matrix, we could think matrix mathematical convention could be row-major and column-major too, that’s why I emphasized the 4x4 matrix in Row-major mathematical convention at the very beginning of this article, we represent the 4x4 matrix in Column-major mathematical convention, and flip everything in theory we discussed before, then put them in the cache-friendly memory layout, that’s all!

And below is how the mathematical representation looks like, remember we still read the row first, then read the column:

\begin{bmatrix} a_{00} & a_{10} & a_{20} & a_{30} \\ a_{01} & a_{11} & a_{21} & a_{31} \\ a_{02} & a_{12} & a_{22} & a_{32} \\ a_{03} & a_{13} & a_{23} & a_{33} \\ \end{bmatrix}

And now the vector4 in Column-major mathematical convention is:

\begin{bmatrix} x = a_{00} & y = a_{10} & z = a_{20} & w= a_{30} \\ \end{bmatrix}

And the translation matrix is:

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ T_x & T_y & T_z & 1 \\ \end{bmatrix}

Now the ${T_x}$ 、 ${T_y}$ 、 ${T_z}$ (and ${T_w}$ ) is $a_{03}$ 、 $a_{13}$ 、 $a_{23}$ （And $a_{33}$ ）

And finally we get the best choice as:

0x0000	0x0001	0x0002	0x0003	0x0004	0x0005	0x0006	0x0007	0x0008	0x0009	0x000A	0x000B	0x000C	0x000D	0x000E
m[0][0]	m[0][1]	m[0][2]	m[0][3]	m[1][0]	m[1][1]	m[1][2]	m[1][3]	m[2][0]	m[2][1]	m[2][2]	m[2][3]	m[3][0]	m[3][1]	m[3][2]
$a_{00}$	$a_{01}$	$a_{02}$	$a_{03}$	$a_{10}$	$a_{11}$	$a_{12}$	$a_{13}$	$a_{20}$	$a_{21}$	$a_{22}$	$a_{23}$	$a_{30}$	$a_{31}$	$a_{32}$

Actually, as you can see here, nothing changed in memory, what changed exactly is our mathematical convention and our mind. And then I feel it’s “meaningless” to distinguish between “row-major” or “column-major” inside C++, and this should be the reason why some people call C++ as a “row-major” language. But still, we could have a more clear point of view after all of these red tapes just before we start to write something.

But in practice, unfortunately, due to the historical reason of OpenGL (see OpenGL FAQ 9.005), it uses 4x4 matrix in Column-major mathematical convention and a memory layout corresponding to Column-major memory layout in C/C++ (as the document said

The translation components occupy the 13th, 14th, and 15th elements of the 16-element matrix

, means m[3][0] to m[3][3] are ${T_x}$ 、 ${T_y}$ 、 ${T_z}$ and ${T_w}$ ). For the sake of different mathematical conventions, glUniformMatrix4fv function has a parameter at the third position to indicate if we want to ask the vendor’s implementation library to flip the input matrix or not. If we choose to use the same mathematical convention and memory layout as OpenGL initially defined then we just set it to GL_FALSE, means no flipping. And if we choose to use 4x4 matrix in Row-major mathematical convention but didn’t change the memory layout in C++, we need to set it to GL_TRUE to flip the input matrix, as same as the situation when we change the memory layout but don’t change the mathematical convention. If you are some messy guy like me who intended not to use DirectXMath and want to achieve a unified math library among different graphics APIs, you’d better have a caution about it.

Actually it is really a confusing topic, because how we read how we count how we use and understand the terms finally influence our implementation, in my game engine I decided to support both two different conventions and until now it works fine, but in large scale products I assume that would become a problem if people didn’t make the clarification well and clearly at the beginning.

Code samples

Some code samples here:

//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
	template<class T>
	auto TVec4::toTranslationMatrix(const TVec4<T>& rhs) -> TMat4<T>
	{
		// @TODO: replace with SIMD impl
		TMat4<T> l_m;

		l_m.m[0][0] = one<T>;
		l_m.m[1][1] = one<T>;
		l_m.m[2][2] = one<T>;
		l_m.m[3][0] = rhs.x;
		l_m.m[3][1] = rhs.y;
		l_m.m[3][2] = rhs.z;
		l_m.m[3][3] = one<T>;

		return l_m;
	}
	//Row-major memory layout
#elif defined (USE_ROW_MAJOR_MEMORY_LAYOUT)
	template<class T>
	auto toTranslationMatrix(const TVec4<T>& rhs) -> TMat4<T>
	{
		// @TODO: replace with SIMD impl
		TMat4<T> l_m;

		l_m.m[0][0] = one<T>;
		l_m.m[0][3] = rhs.x;
		l_m.m[1][1] = one<T>;
		l_m.m[1][3] = rhs.y;
		l_m.m[2][2] = one<T>;
		l_m.m[2][3] = rhs.z;
		l_m.m[3][3] = one<T>;

		return l_m;
	}
#endif

//Column-major memory layout
#if defined (USE_COLUMN_MAJOR_MEMORY_LAYOUT)
	template<class T>
	auto toRotationMatrix(const TVec4<T>& rhs) -> TMat4<T>
	{
		// @TODO: replace with SIMD impl
		TMat4<T> l_m;

		l_m.m[0][0] = (one<T> -two<T> * rhs.y * rhs.y - two<T> * rhs.z * rhs.z);
		l_m.m[0][1] = (two<T> * rhs.x * y + two<T> * rhs.z * rhs.w);
		l_m.m[0][2] = (two<T> * rhs.x * rhs.z - two<T> * rhs.y * rhs.w);
		l_m.m[0][3] = (T());

		l_m.m[1][0] = (two<T> * rhs.x * rhs.y - two<T> * rhs.z * rhs.w);
		l_m.m[1][1] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.z * rhs.z);
		l_m.m[1][2] = (two<T> * rhs.y * rhs.z + two<T> * rhs.x * rhs.w);
		l_m.m[1][3] = (T());

		l_m.m[2][0] = (two<T> * rhs.x * rhs.z + two<T> * rhs.y * rhs.w);
		l_m.m[2][1] = (two<T> * rhs.y * rhs.z - two<T> * rhs.x * rhs.w);
		l_m.m[2][2] = (one<T> -two<T> * rhs.x * rhs.x - two<T> * rhs.y * rhs.y);
		l_m.m[2][3] = (T());

		l_m.m[3][0] = (T());
		l_m.m[3][1] = (T());
		l_m.m[3][2] = (T());
		l_m.m[3][3] = (one<T>);

		return l_m;
	}
	//Row-major memory layout
#elif defined ( USE_ROW_MAJOR_MEMORY_LAYOUT)
	template<class T>
	auto toRotationMatrix(const TVec4<T>& rhs) -> TMat4<T>
	{
		// @TODO: replace with SIMD impl
		TMat4<T> l_m;

		l_m.m[0][0] = (one<T> -two<T> *  rhs.y *  rhs.y - two<T> *  rhs.z *  rhs.z);
		l_m.m[0][1] = (two<T> *  rhs.x *  rhs.y - two<T> *  rhs.z *  rhs.w);
		l_m.m[0][2] = (two<T> *  rhs.x *  rhs.z + two<T> *  rhs.y *  rhs.w);
		l_m.m[0][3] = (T());

		l_m.m[1][0] = (two<T> *  rhs.x *  rhs.y + two<T> *  rhs.z *  rhs.w);
		l_m.m[1][1] = (one<T> -two<T> *  rhs.x *  rhs.x - two<T> *  rhs.z *  rhs.z);
		l_m.m[1][2] = (two<T> *  rhs.y *  rhs.z - two<T> *  rhs.x *  rhs.w);
		l_m.m[1][3] = (T());

		l_m.m[2][0] = (two<T> *  rhs.x *  rhs.z - two<T> *  rhs.y *  rhs.w);
		l_m.m[2][1] = (two<T> *  rhs.y *  rhs.z + two<T> *  rhs.x *  rhs.w);
		l_m.m[2][2] = (one<T> -two<T> *  rhs.x *  rhs.x - two<T> *  rhs.y *  rhs.y);
		l_m.m[2][3] = (T());

		l_m.m[3][0] = (T());
		l_m.m[3][1] = (T());
		l_m.m[3][2] = (T());
		l_m.m[3][3] = one<T>;

		return l_m;
	}
#endif

Further more

For the multiplication algorithm of matrix, the naive implementation is $O(n^3)$ , and there are lots of optimized algorithms (e.g. Solvay Strassen algorithm $O(n^{2.807})$ /Coppersmith-Winograd algorithm $O(n^{2.3737})$ /Stochers-Williams algorithm $O(n^{2.3729})$ could bring a better time complexity around $O(n^{2.4})$ .

For the transpose algorithm, we could consider about in-situ matrix transposition.

And for the inverse algorithm, Gaussian Elimination is $O(n^3)$ , we could choose LU decomposition, and also there are lots of optimized algorithms which I don’t have time to investigate too much.

Hope one day I’ll finally sit down and optimize my super slow matrix inverse operation!

Published Mar 24, 2018

Random randomness in randomized randomization.Hang Zhang on Twitter