Hi Amir,
thanks for the comment. As a matter of fact, I couldn't train the model because of lack of resources so I couldn't carry out an apple-to-apple comparison (as I mentioned, it took 100 hours to train a single epoch, then I gave up). I also must point out this is a hybrid transformer, not a fully quantum one. I think the lesson so far is that there are a number of software packages around, and especially PennyLane, that make the transition very easy in term of designing the networks. I wouldn't say the same for computing platforms. Perhaps one can train such a complicated network on AWS Braket, but that's not for free. Perhaps one can get some academic access, but if you work in the private sector, I don't see how this can be pulled off unless you work for a Big Company.
In any case, I think this is a first step toward a quantumization of the most successful network architecture to this date to analyze sequential data.