Gauge Problem in Automatic Differentiation
- Hello world example
- AD for general eigen decomposition
- AD for complex valued SVD
Automatic differentiation infrastructure has been integrated with every machine learning libraries. However, there are still some missing and wrong implementations of AD formula in terms of involved linear algebra operations such as SVD or eigen decompositions. The case becomes even worse when complex numbers play roles in these linear algebra operations let alone some ML libraries don’t support complex numbers at all.
Although there are many famous technical reports with lists of AD formulas for linear algebra operations, they either focus on real case123 or, as I will show below, often wrong with missing terms or wrong in the proof (though a wrong proof can still lead to correct AD formula in some cases)45.
Therefore, in this post, I will explain how to derive the correct AD formula for complex valued SVD and general eigen decompositions (which is complex by nature). I will also explain how the formulas and proofs in literatures are wrong or imperfect. The failure of previous proofs is rooted in the notorious gauge problem in these linear algebra operators, and thus I will demonstrate how to incorporate gauge freedoms in AD context. The basic knowledge on automatic differentiation, complex calculus6 and linear algebra is required to understand this post.
Hello world example
As every tutorial style post, let’s begin with a hello world example for AD with gauge problem. The forward operation is defined as , where and are all scalars. Note this example is very artificial though it actually corresponds to 1D eigen decomposition.
General framework for AD
In general, both forward mode AD and reverse mode AD formula can be derived from function relations
And theoretically, we believe such relations can always be found, as they describe the change of outputs with the change of inputs which is totally determined by the well-defined function operation. If we find , the forward mode AD formula is just:
Note in stands for the function form defined in instead of the differentials. As for reverse mode AD, we have the full differentiation of loss function as:
This formula is general for matrix and complex numbers. The adjoint is defined as , and is for Hermitian conjugate. Since we have the knowledge of dependence and so on, we could replace all on RHS with , and the term before on RHS is just same as . By this, we can derive reverse mode AD formula as function dependence by comparing coefficients before .
The above is the general framework for finding AD formula for any operators. The core relation is , and we must find such function dependence of differentials. Let’s now see how we can apply the above procedure to the hello world example here (spoilers: you will fail).
To get , we differentiate functions and restrctions as:
One can simply find , but it is obvious one can never find . and seems to decouple, namely has become a free differential that doesn’t depend on the differential of the inputs. If we cannot find from the beginning, how can we get the formula for AD in both forward and reverse mode? Where is the problem?
The problem is that the function primitive is somehow ill-posed, i.e. is not unique given input . Say is one output from , then must also be an legal output from the same input when is a phase. We can verify this by and . There are infinite outputs that satisfying the definition of . We call such extra freedom of choosing outputs as gauge. Functions with gauge freedom make sense as long as the final objective(loss function) is independent of the gauge. For example, we can define loss function here, and such loss function is independent of (the objective is trivial as our hello world example is over simplified, but you get the idea).
To summarize, some function or linear algebra operations cannot uniquely determine the outputs based on inputs, the extra freedom is denoted as gauge. The whole setup still make sense as long as the final objective is independent of the gauge.
Gauge invaraince as a condition
To make the definition of valid, we have another condition here: the loss function must independent of . i.e. .
By taking we can have:
It is worthing emphasizing that is not strong enough to conclude . Instead, gauge invariant loss function only guarantee . Since we know , when , , . Back to 1D case for our hello world example, we have the following condition from gauge invariance:
At first sight, this relation is too weak to utilize. It just states something must be real. But let’s now directly tackle without knowing exactly (which is impossible to uniquely determine due to gauge problem):
In the third line, we have utilized the gauge invariance condition , and in the last line, we have utilized the fact , which gives . Therefore, the reverse mode AD formula is .
Hindsight and gauge fixing approach
After the derivation, we found that indeed has no contribution to . But this is different from . is not true and this is the error various works have made before. Say , which is a legal output according to the definition of , the corresponding is for sure non zero. You may argue that by an appropirate gauge fixing scheme, could indeed be zero and there is no need to worry about gauge problem. It is true for this simple hello world example, but in more complex case like eig or svd, it is nearly impossible to fix gauge in a natural way. And simply taking some free differentials to be zero is not only wrong in the proof but can also be wrong in the final result. In this hello world example, happen to a legal gauge fixing scheme which is not the case for more general settings.
More specifically, one can for sure derive correct AD formula by gauge fixing, and such approach is in general easier especially in terms of forward mode AD as we will discuss in the following. In our context, differentials of function definitions for operators cannot fully determine the relation between differentials of outputs and inputs. There must be some differentials of outputs are free from the function definition constraint due to gauge freedom for the outputs. In our toy example, the free differential is just . But it is not always a correct and safe gauge fixing scheme by simply let such free differentials to be zero. This is because there might be extra restriction conditions on outputs which impose some intrinsic restriction on the free differentials. Such restriction may enforce nonzero nature for free differentials. In other words, it is impossible to make free differentials to be zero no matter which gauge is chosen. That being said, after careful consideration on the restrictions, we could fix the gauge by specify some restriction-consistent form to free differentials. And a correct gauge fixing scheme is in general simpler than doing all stuff in a gauge invariant style.
In retrospect, since there is extra gauge freedom, we have no way to derive the full formula of differentials in form uniquely as itself cannot fully determine . Therefore, we must ask help for the condition from gauge invariant loss functions. By integrating this condition into , we could hopefully get rid of the direct need of .
Forward mode AD
Even after we have derived the reverse mode AD formula for this toy operation in gauge invariant fashion, we still have no idea about which is actually somehow arbitrary due to gauge problem. But we need to obtain formula for forward mode AD. There are two approaches to achieve this. The first one is somehow in a gauge invariant fashion by asking help for revese mode formula . The philosophy is like do reverse mode twice7 to get forward mode. After we have obtained , we have:
By comparing the coefficients before , we have the effective relation . Such relation only hold for some fixed gauge.
The second approach is to directly utilize types of equations by gauge fixing. This approach is more straightforward to obtain possible forward mode AD formula. However, as we mentioned in the last subsection, one should by very careful about gauge fixing. Gauge fixing is by no means just taking missing differentials as zero. The restrictions on ouputs shipped with the function definition can enforce nonzero nature of such differentials.
We follow the first approch here. In our toy example, we have:
By comparing coefficients before , we have . Such effective is not the same thing of directly dropping at the stage of proof for reverse mode AD. Again, this over simplified example is not good enough to show this subtle difference. Here, is given by reversing the reverse AD formula. We now have the forward mode AD for this example as: .
Freedom in AD formula
This is not the end of the story for hello world example. We can actually design many “correct” formulas for reverse mode AD and forward mode AD. For example, the reverse mode AD formula can be written as . Although the form of the formula can be different, the numerical value is the same for . Similarly,we can then obtain the forward mode AD formula as . Suprisingly, different version of forward AD formula give different even in numerical value sense. Forward AD still makes sense with gauge freedom since is the same due to the gague invariance objective , no matter how different of can be.
Summary on the problem structure
We learn about the general strcture of the problem from this toy example. There are three parts determining a well-defined forward operator: function, restriction and gauge. The function defines how outputs depends on the inputs like here. However, such function alone cannot fully determine the outputs in some cases and we need extra restrictions to describe the operation, i.e. here. There could still be some extra freedoms for the outputs after we have included restrictions, these remainning freedoms are thus denoted as gague freedom. In other word, guage freedom and restriction conditions on outputs are two sides of the same coin. All extra freedoms left with function definition should either go to restrictions or gauge freedoms.
In the following, we will discuss two specific AD examples with gauge problem, they are EIG and SVD.
AD for general eigen decomposition
- Input: Matrix A, real or complex valued
- Output: Matrix U, real or complex valued, U could be complex when A is real. Diagonal matrix E, E could also be complex when A is real. And actually, U or E is literally always be complex valued when the dimension of matrix increases.
- Restriction: Every column vector of U (right eigen vector of A) is normalized, i.e. , where denotes elements multiplication.
- Gauge: Based on the above condition, the outputs are not unique. The gauge freedom is diagonal matrix whose elements are all in phase form . If U, E are legal outputs, are also satisfying all the above relations.
Let’s try to get the AD formula following the philosophy of hello world example. The first step is differentiate the function, and find the so-called free differentials, i.e. differentials of some outputs decoupled from function relation.
The differentiation of function is given by:
Multiply on the left, we have:
where . By , we can extract equation from diagonal part as well as off-diagonal part, which are:
where is matrix with all elements one but zero diagonal terms. F is defined with zero diagonal elements and off diagonal terms as:
Since is all information we have on differentials of outputs, term is missing. is the free differential in this problem due to gauge freedom . But we cannot simply set it as zero by gauge fixing arguments, since there is still restrictions in the definition of eig operation. Therefore, the value of must satisy the following restriction on the inner structure of :
Gauge fixing route
Let’s see how impose restrictions on , recall .
Therefore, we cannot safely set zero as is not zero in this gauge in general. But we can fix the gauge safely by simply let
This gauge fixing is safe as it trivially satisy . And we could continue to derive both forward and reverse mode AD formulas, since we now have relations in the form of .
The following is the derivation of backprop formula:
By comparing this with , we have the backprop formula for general eig operation:
Note backprop in that paper is on gradients while is on derivatives. They are different up to complex conjugate. From implementation perspective, autograd backprops derivatives and should utilize exactly while tensorflow backprops gradients and should use the formula in 4.
The forward mode AD is also straightforward:
Gauge invariant route
One can also derive the backprop formula with the knowledge that the objective is gauge invariant as . In this fashion, we can get rid of gauge fixing which is not so elegant. Though I have to admit gauge fixing is the simplest way to derive AD formula especially for forward mode. Obtainning forward mode AD formula in gauge invariant fashion may be very painful.
we conclude that gauge invariant loss function gives:
It should be emphasized that gauge invariance cannot be taken as . In eig case, . Supoose an guage invariant , is obvious nonzero in this case.
Let’s look in detail on . In our case, the diagonal matrix is made from elements like . When , we can easily show that , and thus they are not independent. This is obvious, since each element of is controlled by one real parameter instead of two. Therefore, we have:
Since each row of is independent, the above formula actually indicates:
What if the gauge freedom is enlarged? For example, we can move the normalization condition for U (restrictions) to gauge freedom in eig forward pass settings. In that case, gauge ’s diagonal elements are in the form , namely every right eigenvector can be rescaled as normalization condition is gone. Therefore, and are independent and can vary freely. This indicates that
which is a much stronger condition since the gauge in enlarged. Correspondingly, the objective has to be independent on norm of eigenvectors. For example is gauge invariant when gauge contains only phase, while it is not gauge invariant when norm is also included in gauge.
Now considering the relation in (we only consider part since other parts are trivial and have been shown in the last subsection):
where we have utilized the fact from gauge invariance: is real. According to ,
Plug this into , we can arrive to the same formula as in the guage fixing case.
Operation combination route
There is actually another route towards correct AD formula but free from subtlety of restrctions in the operation definition. In this approach, we consider the eig operation as the combination of two: op1. eig without normalization restriction and thus gauge freedom ; op2. normalization function on each columns of U.
As we have shown in the last subsection, for 1. gauge invariance guarantees that . Therefore, the backprop formula is very simple to obtain for op1 since the most involved part of is canceled as:
From this, we could actually derive the backprop formula in 5. In other words, backprop formula in 5 is only valid when objective is irrelavant with the norm of eigenvectors. Such large gauge strongly suppresses the possibility of loss functions. And proof in 5 is not rigorous by directly dropping terms instead of arguing from gauge invariance like what we have done here. 4 also has the same loophole in the proof for op1.
We can now pipeline op2 following op1, and then consider the derivatives for the combination of operations. There are two approaches: 1. derive the derivatives for the composed operator by chaining the derivatives of individual operators by hand as 4 (see details in 4 P30); 2. utilize AD machinery by smartly defining the operator: include a normalization process in the operator definition. Even though
tf.linalg.eig already returns normalized eigenvectors, we still pipeline this with a normalization function as the forward pass. By this, the backpropagation is automatically correct without any further derivations.
In highsight, the difficuly of AD doesn’t come from gauge but comes from restrictions. The larger the gauge, the smaller the restrictions, the easier to obtain AD formulas, though the apllicable range of loss functions is also smaller. With operation combination idea, we first derive the AD formula for functions with no restrictions on outputs and thus the largest gauge freedoms. And then by applying restriction equivalent transformations, we could obtain the AD formula for the combination of operators as a whole either by analytically derivation or by AD machinery.
Side notes on real input case
After obtaining the correct backprop formula for eigen operation, there are still some issues when the input matrix A is real. The backprop formula gives complex results for real inputs and real loss functions(outputs). How could a function with real inputs/outputs give complex derivatives?
Well, the answer is that there are two different scenarios. 1). complex input/output happens to be real at the same time. For example , the input x and output y can be both real. But the derivative is still , which is a complex number. This is because there is no restriction for x to move toward imaginary axis. This argument applies to eig case. If the input matrix is allowed to vary freely in complex field but happen to be a real one, then it is ok to have complex derivatives. However, 2) if input/output is restricted to real number field then the derivative is the real part of the result from the AD formula .
This fact could also be explained as follows. If A is only allowed to vary in real number field, then , , is reformulated as
Therefore, we now have . This formula demonstrates why we need to take real part of the derivative for the complex case.
So the better understanding on this issue is as follows: the general eigen operation can be viewed as the combination of
A=tf.cast(A, tf.complex) follows by usual
E, U = tf.linalg.eig(A). From this perspective, the input of
eig can be thought as always complex. Then it is fine to have complex derivatives , but such direvatives will be casted to real type when backpropagating through the cast op8. Therefore, whether we should take real part of the derivative results depends on ourselves: it is ultimately connected to whether complex input matrix is allowed.
Operations such as
eig implies that complex number can naturally emerge from real valued input. In broader context, only complex number is natural for general operations. It is somehow silly to restrict all operations to real numbers in deep learning setups and omit the implementation of complex valued tensors in some machine learning libraries. Complex number is everywhere and this reflects an important math fact that the field of complex numbers is the largest algebraically closed field.
AD for complex valued SVD
SVD is defined as , where are both unitary and S is diagonal. It is easy to see there is gague freedom for this operation. Namely, are also legal outputs for SVD as long as is a diagonal matrix with phase elements in the form . Therefore complex valued SVD naturally falls into the category of operation with gauge freedoms. For details on the derivation of backprop formula for SVD, just check our work9. v3 of this preprint gives the correct proof, and it is also interesting to see how the first two versions of proof are problematic.
Many ideas and derivations in this post are credited to my colleague Zhou-Quan Wan. We also thank Jin-Guo Liu and Evgeniy Zheltonozhskiy for relevant discussions10 on AD for SVD and EIG case respectively.
A new trick for calculating Jacobian vector products, also see very informative discussions at https://github.com/HIPS/autograd/pull/175. ↩