Accelerated randomized benchmarking

Christopher Granade; Christopher Ferrie; D G Cory

doi:10.1088/1367-2630/17/1/013042

Quantum information processing devices offer great promise in a variety of different fields, including chemistry and material science, data analysis and machine learning [1–4], as well as cryptography [5]. Over the past few years, proposals have been advanced for quantum information processing past the classical scale, based on node-based architectures [6, 7]. In addition, rapid progress has been made towards experimental implementations that might allow for developing such devices [8, 9]. An impediment in this effort, however, is presented by the difficulty of calibrating and diagnosing quantum devices.

In particular, in the development of quantum information processing, an important experimental challenge is to efficiently characterize the quality with which we can control a quantum system. By characterizing the quality of a quantum gate that is implemented by a control pulse, we can then reason about the utility of that gate for quantum information processing tasks. For instance, we can estimate the feasibility of and the resources required to implement error correction using that control by comparing to proven and numerically estimated fault-tolerance thresholds [10, 11]. Alternately, we can adjust our control sequences to account for differences between our control model and the actual system.

In cases where only the quality of a quantum gate or set of gates is required, randomized benchmarking has proven to be a useful means of extracting this information with relatively little experimental effort [12]. This has been demonstrated in a variety of experimental settings [9, 13–19]. Randomized benchmarking has also been used to improve gate fidelities by characterizing cross-talk [20] or distortions [21]. Extracting fidelity information can often be useful in diagnosing performance and problems with a device in lieu of full characterization [22]. Moreover, randomized benchmarking has also been used to extract information about the completely positive and unital parts of linear maps [23].

Here, using near-optimal data processing together with prior information, we accelerate the data processing used in benchmarking experiments, such that to achieve the accuracy demanded of benchmarking protocols, we require orders of magnitude less experimental data. We also extend results on the achievable estimation quality in the presence of finite sampling [24] and prior information, then show that our accelerated methods are nearly optimal. Our data processing methods also provide estimates of their own performance, such that our approach thus enables randomized benchmarking to be used where data collection costs make existing benchmarking protocols impractical. Thus, our work complements recent results on the robustness of randomized benchmarking [25] to provide an experimentally useful tool⁷ .

Randomized benchmarking has been recently used to adaptively calibrate control designed by optimal control theory methods such as GRAPE [26], allowing for differences between the control model and the actual system to be adjusted for in experimental practice [27]. These methods are applied in a control design and calibration step, however, and do not allow for control for to be recalibrated dynamically. Whereas randomized benchmarking is performed at the inner-loop of current control calibration algorithms [22], any data collection overhead in benchmarking becomes a very significant cost to control calibration as a whole. Thus, by reducing the data requirements using both better fitting methods and strong prior information, we can enable new applications, such as extending control calibration to an online context.

Here, we show that by using prior information together with the sequential Monte Carlo (SMC) parameter estimation algorithm, we can obtain very accurate estimates of parameters. Moreover, we do so even in the limit of one bit of data per sequence length. That is, we can use a variety of sequence lengths to probe the performance of our gate set rather than repeat many experiments at a given sequence length. On the other hand, we also show that for gates with fidelities near unity, increasing the length of benchmarking sequences offers little compared to repeating experiments at already optimal sequence lengths. The SMC algorithm is based on Bayesian methods, which have been used successfully in a variety of quantum information processing tasks [28–34]. SMC has recently been used in quantum information to learn states [30] and Hamiltonians [35, 36], and to provide robust error bounds on inferred parameters [37]. The primary cost incurred by the SMC algorithm is that the data must be simulated repeatedly, though this can be mitigated by using quantum resources [38–40]. Here we show that since the symmetries afforded by random benchmarking experiments can be used to simulate datasets with costs that are constant with respect to the dimension of the Hilbert space of interest [12], SMC can be implemented with little overhead. Thus, randomized benchmarking mitigates the primary disadvantage of SMC by removing the need to simulate the quantum system.

Moreover, the method of hyperparameters [35] generalizes our approach to allow gate fidelities to be non-trivial functions of some other parameter of interest, such that the underlying parameter is learned directly. This approach is especially relevant if, for example, the effect of the unknown hyperparameter depends on an experimental choice, such that distinct benchmarking experiments can be used in concert in a straightforward way.

Our work proceeds first by defining the benchmarking model that we use, then showing bounds on the estimation of the parameters of this model using the Cramer–Rao bound. We then apply SMC to the benchmarking model and compare to the performance of traditional methods, and to the optimal performance achievable with prior information, showing that our method offers distinct advantages, and is nearly optimal.

1. Interpretation of likelihood as marginalization

Randomized benchmarking consists of using a sequence of random gates to effectively average the action of an error channel such that it can be simulated using simple classical models. If the gates in each randomized benchmarking sequence are chosen uniformly at random from the Clifford group, then the argument of Magesan et al [12] shows that the average fidelity F_g taken over all randomized benchmarking sequences of a given length can be expressed in terms of the survival probability

$\begin{eqnarray}&&{\rm Pr} ({\rm survival}|\psi ,{{{\boldsymbol{i}} }_{m}})={\rm Tr}[{{E}_{\psi }}{{\widehat{{\hat{S}}}}_{{{{\boldsymbol{i}} }_{m}}}}({{\rho }_{\psi }})],\end{eqnarray} \tag{ 1 }$

where ${{E}_{\psi }}$ is a measurement operator corresponding to a fiducial state ${{\rho }_{\psi }}$ , and where ${{\widehat{{\hat{S}}}}_{{{{\boldsymbol{i}} }_{m}}}}={{\widehat{{\hat{S}}}}_{{{i}_{m}}}}\circ \cdots \circ {{\widehat{{\hat{S}}}}_{{{i}_{1}}}}$ is the superoperator representing the sequence ${{{\boldsymbol{i}} }_{m}}$ . Because the Clifford group $\mathcal{C}$ forms a unitary two-design, random sequences of Clifford gates average the errors in each gate over the Haar measure, an operation known as twirling. In particular, given a channel Λ, conjugating the action of that channel by ideal Clifford gates chosen uniformly at random implements the twirling superchannel [12],

$\begin{eqnarray}\begin{array}{rcl} W[\Lambda ](\rho ) & = & \int {\rm d}U\;{{U}^{\dagger }}\Lambda [U\rho {{U}^{\dagger }}]U \\ {} & = & \frac{1}{|\mathcal{C}|}\mathop{\sum }\limits_{U\in \mathcal{C}}\;{{U}^{\dagger }}\Lambda [U\rho {{U}^{\dagger }}]U=p\rho +(1-p)\frac{{\mathbb{1}}}{d}, \\ \end{array}\end{eqnarray} \tag{ 2 }$

where d is the dimension of the Hilbert space on which each gate acts, and where p is related to the average gate fidelity F of Λ by $p=(dF-1)/(d-1)$ .

The expectation value of this survival probability over all sequences of a given length m was shown to produce the uniform-average fidelity

$\begin{eqnarray}&&{{F}_{g}}(m,\psi )={{\mathbb{E}}_{{{{\boldsymbol{i}} }_{m}}|m}}[{\rm Pr} ({\rm survival}|\psi ,{{{\boldsymbol{i}} }_{m}})]={\rm Pr} ({\rm survival}|\psi ,m).\end{eqnarray} \tag{ 3 }$

We may thus interpret the fidelity averaged over a unitary design as a probability of survival in an experiment in which we do not know the sequence being performed. As discussed in detail in appendix, if sequences are fairly drawn from the two-design independently of all other experimental choices, then this is a valid assumption, such that the marginalized survival probability can be taken as the likelihood for our randomized benchmarking model. Note that in the remainder of the paper, we will let ψ be fixed, and will drop the notation conditioning on this assumption.

Using the expansion of the marginalized survival F_g(m) given by Magesan et al [12], we can rewrite the likelihood in a way that explicitly depends on the parameters of interest, and that no longer requires simulating the quantum dynamics of the system. Thus, we can use Bayesian methods without simulating the system under study. In [12], the authors studied a sequence of models of increasing complexity, where the lowest complexity model has found the most use. In particular, we consider the Magesan et al model [12], which they call the zeroth-order model for randomized benchmarking

$\begin{eqnarray}&&{{F}_{g}}(m)={{A}_{0}}{{p}^{m}}+{{B}_{0}}\end{eqnarray} \tag{ 4 }$

for parameters A₀, B₀ which encode errors in preparation and measurement, and p. These parameters are given formally by

$\begin{eqnarray}&&{{A}_{0}}:={\rm Tr}\left[ {{E}_{\psi }}\Lambda \left( {{\rho }_{\psi }}-\frac{{\mathbb{1}}}{d} \right) \right]\end{eqnarray} \tag{ 5a }$

$\begin{eqnarray}&&{{B}_{0}}:={\rm Tr}\left[ {{E}_{\psi }}\Lambda \left( \frac{{\mathbb{1}}}{d} \right) \right]\end{eqnarray} \tag{ 5b }$

$\begin{eqnarray}&&p:=(d{{F}_{{\rm ave}}}-1)/(d-1),\end{eqnarray} \tag{ 5c }$

where d is the dimension of the Hilbert space. Above, ${{F}_{{\rm ave}}}$ is the fidelity of the average channel $\Lambda ={{\mathbb{E}}_{i,j}}[{{\Lambda }_{i,j}}]$ , taken over time steps i and elements of the gate set j. By these definitions, for ideal preparation, evolution and measurement, ${{A}_{0}}=1-\frac{1}{d}$ and ${{B}_{0}}=\frac{1}{d}$ . Since we will often use the example of a qubit, we thus have that the ideal ${{A}_{0}}={{B}_{0}}=1/2$ . A sketch of the derivation of this model is given in figure 1. The interpretation of first- and higher-order models follows in a similar manner. Since we use the zeroth-order model as an example in this work, we will drop the subscript-0 for brevity.

**Figure 1.** Sketch of Magesan *et al* derivation of the zeroth-order model [12]. (a) Sequence of length m = 3 Clifford operations. (b) Change of variables to V_i, factoring out previous gates ${{U}_{i-1}}$ , and with the base case ${{U}_{1}}={{V}_{1}}$ . The V gates then form a two-design. (c) Expectation value over random gates in (a) and (b), giving the twirling superchannel W acting on Λ.
Download figure:
Standard image High-resolution image

**Figure 1.** Sketch of Magesan *et al* derivation of the zeroth-order model [12]. (a) Sequence of length m = 3 Clifford operations. (b) Change of variables to V_i, factoring out previous gates ${{U}_{i-1}}$ , and with the base case ${{U}_{1}}={{V}_{1}}$ . The V gates then form a two-design. (c) Expectation value over random gates in (a) and (b), giving the twirling superchannel W acting on Λ.
Download figure:
Standard image High-resolution image

Because the fidelity of a channel is invariant under Clifford twirling, the parameter p represents the strength of the depolarizing channel of fidelity ${{F}_{{\rm ave}}}$ produced by twirling the average channel Λ, and can be used to recover ${{F}_{{\rm ave}}}$ . Similarly, in the interleaved protocol [41], we consider two probabilities, ${{p}_{{\rm ref}}}$ and ${{p}_{{\bar{\mathcal{C}}}}}$ , respectively representing the sequences with m random Clifford gates multiplied together, or interleaved with some gate $\mathcal{C}$ under study. From these probabilities, we can extract the referenced probability of gate error $\tilde{p}:={{p}_{{\rm ref}}}/{{p}_{{\bar{\mathcal{C}}}}}$ . Each of ${{p}_{{\rm ref}}}$ and ${{p}_{{\bar{\mathcal{C}}}}}$ is traditionally extracted from a fit to the zeroth- or first-order model⁸ .

2. Achievable accuracy

We now consider only the interleaved model since it is more general. For brevity, we represent the model by a vector ${\boldsymbol{x}} =(\tilde{p},{{p}_{{\rm ref}}},A,B)$ , so that the likelihood function for the interleaved model is

$\begin{eqnarray}{\rm Pr} (1|{\boldsymbol{x}} ;m,{\rm mode})=\left\{ \begin{array}{ccccccccccccccc} Ap_{{\rm ref}}^{m}+B & {\rm mode}\;{\rm is}\;{\rm reference}, \\ A{{({{p}_{{\rm ref}}}\tilde{p})}^{m}}+B & {\rm mode}\;{\rm is}\;{\rm interleaved}, \\ \end{array} \right.\end{eqnarray} \tag{ 6 }$

where we have labeled the survival event by '1' to more easily allow for using binomial distributions to consider sums over multiple measurements of the same sequence length.

Having defined our model, it is critical to account for the accuracy with which we can estimate the parameters using finite data records. Here, we extend the results of Epstein et al [24] by explicitly calculating the Fisher information of ${\rm Pr} (1|{\boldsymbol{x}} )$ . We can find a bound on the achievable estimation error in this model by appealing to the Cramér–Rao bound [42], which states that the Fisher information matrix ${\boldsymbol{I}} ({\boldsymbol{x}} )$ bounds the error matrix ${\boldsymbol{E}} ({\boldsymbol{x}} )$ of any unbiased estimator ${\boldsymbol{\hat{x}}}$ by the inequality

$\begin{eqnarray}&&{\boldsymbol{E}} ({\boldsymbol{x}} )\;:=\;{{\mathbb{E}}_{D|{\boldsymbol{x}} }}[({\boldsymbol{\hat{x}}} (D)-{\boldsymbol{x}} ){{({\boldsymbol{\hat{x}}} (D)-{\boldsymbol{x}} )}^{{\rm T}}}]\geqslant {\boldsymbol{I}} {{({\boldsymbol{x}} )}^{-1}}.\end{eqnarray} \tag{ 7 }$

The Fisher information matrix for a single two- outcome measurement is generically a rank-1 matrix, and thus cannot be inverted for models with more than a single model parameter. Thus, if the Fisher information matrix is singular, as is the case here when all of the measured sequences are of the same length, the inverse is taken to be the Moore–Penrose pseudo-inverse. Since the rank of the Fisher information for multiple two-outcome measurements is limited to be at most the number of distinct measurements performed, with at least four different sequence lengths we can break the degeneracy. Since this number depends on the dimension of the model and not the underlying Hilbert space, only four measurements are required to break the degeneracy, even for systems of higher dimension than qubits.

It is often the case that we are only interested in $\tilde{p}$ , the survival probability, and hence the gate fidelity, of a particular gate [13]. In this case, we can bound the error of only that parameter by looking at a single element of the error and Fisher information matrix as

$\begin{eqnarray}&&{\boldsymbol{E}} {{({\boldsymbol{x}} )}_{\tilde{p},\tilde{p}}}\geqslant 1/{\boldsymbol{I}} {{({\boldsymbol{x}} )}_{\tilde{p},\tilde{p}}}.\end{eqnarray} \tag{ 8 }$

To find the Fisher information for randomized benchmarking, we derive the Fisher score ${\boldsymbol{q}}$ of this model, conditioned on 1

$\begin{eqnarray}\begin{array}{rcl} {\boldsymbol{q}} ({\boldsymbol{x}} |1;m,{\rm mode}) & = & {{\nabla }_{{\boldsymbol{x}} }}{\rm log} {\rm Pr} (1|{\boldsymbol{x}} ;m,{\rm mode}) \\ {} & = & {\rm Pr} {{(1|{\boldsymbol{x}} ;m,{\rm mode})}^{-1}}\left\{ \begin{array}{ccccccccccccccc} (0,\;Amp_{{\rm ref}}^{m-1},\;p_{{\rm ref}}^{m},\;1) & {\rm reference}, \\ (Am{{\tilde{p}}^{m-1}}p_{{\rm ref}}^{m},\;Am{{\tilde{p}}^{m}}p_{{\rm ref}}^{m-1},\;p_{{\rm ref}}^{m},\;1) & {\rm interleaved}, \\ \end{array} \right. \\ \end{array}\end{eqnarray} \tag{ 9 }$

where the similar expression for the outcome '0' follows immediately. With this, we can calculate the Fisher information matrix ${\boldsymbol{I}} ({\boldsymbol{x}} ):={{\mathbb{E}}_{D|{\boldsymbol{x}} }}[{\boldsymbol{q}} ({\boldsymbol{x}} |D){\boldsymbol{q}} {{({\boldsymbol{x}} |D)}^{{\rm T}}}]$ , where D labels the outcomes.

Fisher information analysis is one of the most powerful tools of statistical analysis since it bounds the performance of the continuous infinity of possible estimators we could choose. However, given the difficulty of analytically computing the inverse of sums over Fisher information matrices of this form, we use numerical methods for its evaluation. In particular, QInfer [43] performs this calculation automatically, given an implementation of (9).

In experimentally relevant regimes, the task is to gain further accuracy when it is known a priori that the fidelity is high. To minimize the error in estimating $\tilde{p}$ , we maximize the corresponding element of the Fisher information matrix. Note that, as is shown in figure 2, this optimum depends strongly on the value of A and B when $\tilde{p},{{p}_{{\rm ref}}}\approx 1$ . Critically, because randomized benchmarking requires no explicit simulation of quantum systems, it can in principle be used even in very large systems, beyond what can be studied using techniques that depend on simulation. Thus, we also consider the limit as $d\to \infty$ , where for ideal measurements and unital channels, we have $B\to 0$ and $A\to 1$ . In this limit, m can be explicitly optimized such that for large systems

$\begin{eqnarray}&&\mathop{{\rm lim} }\limits_{d\to \infty }{{m}_{{\rm opt}}}\approx \frac{1}{1-\tilde{F}{{F}_{{\rm ref}}}},\end{eqnarray} \tag{ 10 }$

where ${{m}_{{\rm opt}}}$ represents the optimal sequence length. As shown in figure 2, this can grow large for $|1-F|\to 0$ , but even for fidelities near thresholds, such as $|1-F|\approx {{10}^{-3}}$ as considered by [9], ${{m}_{{\rm opt}}}$ remains manageable at about 800. This establishes that the sequence length does not grow large too quickly, providing further evidence of the utility of benchmarking even for experiments beyond the scope of tomographic methods.

**Figure 2.** Optimal value of m as a function of the scale A and offset B parameters, with $\tilde{p}=0.9988$ and ${{p}_{{\rm ref}}}=0.9978$ , based on the example of [9]. On the top left, B = 0.5. On the top right, A = 0.25. Below, we take the limit as $d\to \infty$ of $\lceil {{m}_{{\rm opt}}}\rceil$ , assuming $\tilde{F}={{F}_{{\rm ref}}}=F.$
Download figure:
Standard image High-resolution image

The above calculation is relevant in scenarios where the parameters not of interest (that is, A and B) are known fairly well and the gate fidelity is already known to be near unity. If we have prior information that is not of this form, Bayesian analysis is better suited to the task.

The Bayesian analogue of Fisher information analysis is a straightforward generalization. We begin with a distribution $\pi ({\boldsymbol{x}} )$ , called a prior, over the parameters. Ideally, this is a faithful encoding of the the experimenterʼs prior information, but the following analysis works equally well for any distribution. In particular, given a prior distribution $\pi ({\boldsymbol{x}} )$ , the Bayesian information matrix ${\boldsymbol{J}}$ is then defined as [44]

$\begin{eqnarray}&&{\boldsymbol{J}} :={{\mathbb{E}}_{{\boldsymbol{x}} \sim \pi }}[{\boldsymbol{I}} ({\boldsymbol{x}} )].\end{eqnarray} \tag{ 11 }$

To calculate this we can perform a Monte Carlo integral over the prior by drawing samples ${\boldsymbol{x}} \sim \pi$ and evaluating ${\boldsymbol{I}}$ at each ${\boldsymbol{x}}$ .

The Bayesian Cramer–Rao bound (BCRB) then states that the error matrix ${\boldsymbol{E}} :={{\mathbb{E}}_{{\boldsymbol{x}} ,D}}[({\boldsymbol{\hat{x}}} (D)-{\boldsymbol{x}} ){{({\boldsymbol{\hat{x}}} (D)-{\boldsymbol{x}} )}^{{\rm T}}}]$ , also called the risk, of any estimator ${\boldsymbol{\hat{x}}}$ satisfies

$\begin{eqnarray}&&{\boldsymbol{E}} \geqslant {{{\boldsymbol{J}} }^{-1}}.\end{eqnarray} \tag{ 12 }$

The calculation of the BCRB is naturally included into the SMC algorithm, such that our approach bounds its own performance based on the best experimental data available. Moreover, contrary to the Cramer–Rao bound in equation (7), it is known that the mean of the posterior distribution minimizes the error [45]. Thus, we need not seek the optimal estimator, as it naturally arises from a representation of the posterior.

3. Numerical examples

In the numerical examples we consider here, we choose π to be a normal distribution with a mean vector $(\tilde{p},{{p}_{{\rm ref}}},A,B)=(0.95,0.95,0.3,0.5)$ and equal diagonal covariances given by a deviation of $\sigma =0.01$ . The least-squares fit (LSF) estimator is seeded with an initial guess drawn from this prior, so as to fairly compare the estimators. This distribution is intersected with the hard constraints implied by definitions of the parameters, which defines the support of the prior as

$\begin{eqnarray}\begin{array}{rcl} {\rm supp}\;\pi & = & \{(A,B,p)\;:-(1-1/d)\leqslant A\leqslant 1,0\leqslant B\leqslant 1, \\ {} & {} & 0\leqslant p\leqslant 1,0\leqslant Ap+B\leqslant 1\}. \\ \end{array}\end{eqnarray} \tag{ 13 }$

This distribution was chosen as the likelihood model is less degenerate given these constraints, such that it is easier to reason about bounds for approximately unimodal estimation strategies. Our choice of prior is not critical, however, as we will show later that our algorithm recovers well from the case in which we choose a 'bad' prior.

To demonstrate the Bayesian approach, we compare the standard LSF performance to the SMC algorithm [35], which computes estimates by updating the probability of each of a finite list of hypotheses according to Bayes' rule. In the case of randomized benchmarking, this consists of computing (6) for each hypothesis after each batch of measurements. We note that the cost of computing (6) is independent of the dimension of the system, such that randomized benchmarking explicitly avoids simulating quantum evolution with classical resources.

There are essentially two experimental design choices an experimenter can make: the length of the sequence m, and the number of repetitions K. In the first comparison, we fix the sequence length and vary K. In particular, we take all sequence lengths up to 100 for the reference signal and 50 for the interleaved signal. For each such K, we plot the mean squared error for the SMC and LSF estimators, along with the posterior variance, which provides an online estimate of the performance of SMC, and the Bayesian Cramer–Rao bound. The results, shown in figure 3, demonstrate that SMC can be used to obtain useful estimates of $\tilde{p}$ with a few orders of magnitude less data than is used by least-squares fitting. Moreover, this advantage becomes more pronounced as the number of shots per sequence length approaches one, such that SMC is especially useful in cases where data collection is expensive. We note that this advantage reflects both the performance of SMC itself, and the ability of SMC to take advantage of prior information: for small amounts of data, the LSF estimator chooses estimates far from the initial guess drawn from the prior distribution, while the SMC estimate instead refines the prior. Moreover, SMC can accurately characterize its own performance and can obtain significantly closer accuracy to the ultimate bound given by the BCRB. These advantages are similar to other cases in which SMC shows a large advantage over traditional fitting methods for handling data that is far from deterministic [35, 46].

**Figure 3.** Comparison of the mean squared error averaged over trials (risk) achieved by sequential Monte Carlo (SMC) and least-squares fit (LSF) estimators of ${\boldsymbol{x}} =(\tilde{p},{{p}_{{\rm ref}}},A,B)$ , averaged over 100 trials and varied over the number of sequences per length K (top) and the number of the maximum sequences ${{m}_{{\rm max} }}$ (bottom). The Bayesian Cramer–Rao bound and posterior variance (an estimate of SMCʼs performance) are also shown. On top, the simulated reference signal was taken with sequence lengths $\{1,2,\ldots ,100\}$ and the interleaved signal was taken with $m\in \{1,2,\ldots ,50\}$ . On the bottom, $K={{10}^{3}}$ samples were simulated per sequence length. For each ${{m}_{{\rm max} }}$ , $m\in \{1,11,21,\ldots ,{{m}_{{\rm max} }}\}$ was chosen.
Download figure:
Standard image High-resolution image

In figure 3 (bottom), we show the performance of SMC and LSF when the sequence lengths m vary and the number of shots K per sequence length is fixed, demonstrating that SMC can improve upon LSF especially for very short sequences. Moreover, we see the benefit from increasing the sequence length is minimal compared to repeating experiments at a given sequence length near the optimum length found from the Cramer–Rao bound .

**Figure 4.** Example of randomized benchmarking signals collected using physical gate models. Individual points correspond to collected data with K = 1000 sequences per sequence length, while the curves show the true model as obtained using the zeroth-order model definitions.
Download figure:
Standard image High-resolution image

4. Benchmarking with simulated gates

Thus far in the analysis, we have used as a simulator the same zeroth-order model as is used to process and interpret the data. To demonstrate the utility of our approach in comparison with traditional LSF-based benchmarking, we now simulate gates according to a cumulant expansion, with physically realistic models. In particular, we use the superconducting model of Puzzuoli et al [47] together with optimal control theory [26] to generate a set of gates implementing the target unitaries $\{{\mathbb{1}},X,Y,Z,H,P\}$ , where H is the Hadamard gate, and where $P=|0\rangle \langle 0|+{\rm i}|1\rangle \langle 1|$ is the phase gate and $X,Y$ and Z are the Pauli matrices. We then use the superoperators ${{\widehat{{\hat{S}}}}_{U}}$ for implementing each target unitary U obtained from a cumulant simulation [48, 49] to sample from the likelihood function (1)⁹ . An example signal is shown in figure 4.

To process these samples, we then use the zeroth-order likelihood function (6) both as a model for SMC and as a trial function for least-squares fitting. Since the actual implemented gates are known, we can compute the true parameters for comparison. In table 1, we show the true parameters, the result obtained using SMC, and the result obtained using least-squares fitting. The most important thing to note is that correct parameters are a distance 6.90 σ from the prior (meaning the true parameters are outside of the 99.999 9998% credible ellipse). This shows that even in the case when the prior information fails to accurately capture the uncertainty in the true model, SMC still can perform well, providing evidence that our accelerated methods may also be robust, even when used to measure the fidelities of sets of gates with errors that are correlated between distinct gate types, or that include non-trivial unitary components¹⁰ . We show this in more detail in figure 5, comparing the posterior and prior distributions over $\tilde{p}$ to the true and LSF-estimated values.

**Figure 5.** (Left) Comparison of prior distribution, SMC-approximated posterior, true value and LSF-estimate for $\tilde{p}$ for a single run with K = 1000 shots at each of ${{m}_{{\rm ref}}}\in \{1,11,...,191\}$ and ${{m}_{C}}\in \{2,12,...,192\}$ . An intentionally inaccurate prior is used, such that the true value is approximately 6.9 standard deviations from the mean of the prior. As shown in table 1, SMC does well by comparison to LSF, even with the poorly chosen prior. (Right) Data gathered from simulation with physical-model gates.
Download figure:
Standard image High-resolution image

**Figure 5.** (Left) Comparison of prior distribution, SMC-approximated posterior, true value and LSF-estimate for $\tilde{p}$ for a single run with K = 1000 shots at each of ${{m}_{{\rm ref}}}\in \{1,11,...,191\}$ and ${{m}_{C}}\in \{2,12,...,192\}$ . An intentionally inaccurate prior is used, such that the true value is approximately 6.9 standard deviations from the mean of the prior. As shown in table 1, SMC does well by comparison to LSF, even with the poorly chosen prior. (Right) Data gathered from simulation with physical-model gates.
Download figure:
Standard image High-resolution image

Table 1. Results of using SMC and least-squares fitting to estimate the fidelity of U = X, simulated using the superconducting qubit gate set. (Left) Bad prior from figure 5, (right) accurate prior from figure 6.

	Bad prior ( $40\times {{10}^{3}}$ bits)				Good prior ( $3\times {{10}^{3}}$ bits)
	$\tilde{p}$	${{p}_{{\rm ref}}}$	A₀	B₀	$\tilde{p}$	${{p}_{{\rm ref}}}$	A₀	B₀
True	0.9983	0.9957	0.3185	0.5012	0.9983	0.9957	0.3185	0.5012
SMC estimate	0.9953	0.9971	0.2639	0.5164	0.9936	0.9976	0.3007	0.5028
LSF estimate	0.9905	0.9989	0.5525	0.2702	0.9917	0.9988	0.5266	0.2718
SMC error	0.0030	0.0014	0.0545	0.0152	0.0048	0.0019	0.0178	0.0016
LSF error	0.0078	0.0032	0.2341	0.2310	0.0066	0.0031	0.2081	0.2294
SMC relative error	0.300%	0.14%	17.12%	3.03%	0.478%	0.19%	5.58%	0.31%
LSF relative error	0.784%	0.32%	73.50%	46.08%	0.664%	0.31%	65.36%	45.78%

Finally, in figure 6, we demonstrate the advantage of our method in the presence of physical gates together with a more reasonable prior, and using approximately ten-fold less data than in figure 5. Taken with other evidence of the robustness of SMC methods [39, 46], these results thus show that our method is useful and provides advantages in data collection costs in experimentally reasonable conditions.

We also note that LSF provides an accurate estimate of $\tilde{p}$ for the simulations with physical gates, but it appears to be at the expense of providing poor estimates for A and B. Given that the errors in $\tilde{p}$ and those in A and B are not in general uncorrelated, that LSF often provides such poor estimates of A and B makes the estimates of $\tilde{p}$ derived from LSF difficult to trust.

In this work, we have discussed the fundamental limits of the randomized benchmarking technique that are incurred due to small data sets, and have shown an algorithm that reliably saturates this optimum. In doing so, we have shown that by using SMC, with a moderate tradeoff in computational costs, one can obtain as much as two orders of magnitude improvement in estimation accuracy, such that data collection requirements are similarly reduced by as much as a 100-fold. Given the wide and expanding use of randomized benchmarking in experimental practice, this then translates to a significant performance benefit both in benchmarking, and in experimental protocols derived from benchmarking.

Acknowledgments

The numerical methods used in this paper are implemented using the QInfer, QuaEC, QuTiP 2 and SciPy libraries for Python [43, 50–52]. We thank Holger Haas for simulating the gate sets used above, and for discussions. We thank Nathan Wiebe, Ben Criger, Ian Hincks, Josh Combes, Joel Wallman, Joe Emerson and Yuval Sanders for useful discussions and insights. CG and DGC acknowledge funding from Industry Canada, CERC, NSERC and the Province of Ontario. CF acknowledges funding from NSF Grant No. PHY-1212445 and NSERC.

: Appendix. Sampling variance and derivation of marginalized likelihood

In this derivation, we will focus on the zeroth-order model of Magesan et al [12], which gives that the average fidelity F_g(m) over all sequences of length m is given by

$\begin{eqnarray}&&{{F}_{g}}(m)={{A}_{0}}{{p}^{m}}+{{B}_{0}}\end{eqnarray} \tag{ A.1 }$

for constants A₀ and B₀ describing the state preparation and measurement errors, and where $1-p$ is the depolarizing strength of $W[{{\mathbb{E}}_{C\sim {{\mathcal{C}}_{n}}}}{{\widehat{{\hat{S}}}}_{C}}]$ .

We are interested in the single-shot limit, where each measurement consists of first selecting a sequence, then measuring once the survival probability for that sequence. Since this protocol makes no use of the sequence other than its length, we can describe the protocol by marginalizing over the choice of sequence, giving a probability distribution of the form ${\rm Pr} ({\rm survival}|m)$ , where m is a sequence length.

To derive this, we first pick a length m, and then consider the choice of sequence ${\boldsymbol{i}}$ out of all length-m sequences to be a random variate. Thus, there exists probabilities

$\begin{eqnarray}&&{{p}_{m,{\boldsymbol{i}} }}:={\rm Pr} ({\rm survival}|{\boldsymbol{i}} ,m)={\rm Tr}({{E}_{\psi }}{{\widehat{{\hat{S}}}}_{{\boldsymbol{i}} }}[{{\rho }_{\psi }}])\end{eqnarray} \tag{ A.2 }$

for each individual sequence that we could have chosen, such that marginalizing over results in

$\begin{eqnarray}&&{\rm Pr} ({\rm survival}|m)={{\mathbb{E}}_{{\boldsymbol{i}} }}[{\rm Pr} ({\rm survival}|{\boldsymbol{i}} ,m)].\end{eqnarray} \tag{ A.3 }$

If each sequence is drawn with uniform probability, then

$\begin{eqnarray}&&{\rm Pr} ({\rm survival}|m)=\frac{1}{|{{\mathcal{C}}_{n}}{{|}^{m}}}\mathop{\sum }\limits_{{\boldsymbol{i}} \,{\rm s}.{\rm t}.\,{\rm len}\,{\boldsymbol{i}} =m}{{p}_{m,{\boldsymbol{i}} }}.\end{eqnarray} \tag{ A.4 }$

We recognize this as being the average sequence fidelity F_g(m) modeled by Magesan

$\begin{eqnarray}&&{\rm Pr} ({\rm survival}|m)={{F}_{g}}(m)={{A}_{0}}{{p}^{m}}+{{B}_{0}}.\end{eqnarray} \tag{ A.5 }$

To interpret F_g(m) as a likelihood directly, note that we had to consider the Bernoulli trial (single-shot) limit; had we instead taken K distinct sequences and measured each $N\gt 1$ times, we would have arrived at a quite different quantity

$\begin{eqnarray}&&{{\hat{F}}_{g}}(m)=\mathop{\sum }\limits_{k=1}^{K}\hat{F}(m,{{{\boldsymbol{i}} }_{k}}),\end{eqnarray} \tag{ A.6 }$

where $\hat{F}(m,{{{\boldsymbol{i}} }_{k}})$ is the estimate of the sequence fidelity for the particular sequence ${{{\boldsymbol{i}} }_{k}}$ .

The difference is made clear by considering an example with fixed sequence length m, and the variance for a datum $d\sim {\rm Pr} ({\rm survival}|m)$ (labeling 'survival' as 1 and the complementary event as 0)

$\begin{eqnarray}&&{{\mathbb{V}}_{d}}[d|m]={{\mathbb{V}}_{{\boldsymbol{i}} }}[{{\mathbb{E}}_{d}}[d|{\boldsymbol{i}} ,m]]+{{\mathbb{E}}_{{\boldsymbol{i}} }}[{{\mathbb{V}}_{d}}[d|{\boldsymbol{i}} ,m]].\end{eqnarray} \tag{ A.7 }$

The second term corresponds to the mean variance over each fixed sequence ${{{\boldsymbol{i}} }_{m}}$ , and governs how well we can estimate each $F(m,{\boldsymbol{i}} )$ individually. The first term, however, is more interesting, in that it measures the variance over sequences of the per-sequence survival probability ${{p}_{m,{\boldsymbol{i}} }}={{\mathbb{E}}_{d}}[d|{\boldsymbol{i}} ,m]$ . By the argument of Wallman and Flammia [25], this is small when the fidelity being estimated is close to 1; that is, when the gates being benchmarked are very good. For gates that are farther from the ideal Clifford operators, however, or for applications such as tomography via benchmarking [23], this term is not negligible, mandating that many different sequences must be taken for ${{\hat{F}}_{g}}(m)$ to be a useful estimate of F_g(m).

By demanding that each individual shot be drawn from an independently chosen sequence, our approach avoids this and samples from $d|m$ directly. In this way, we see a similar effect as in [32]. In particular, it is not advantageous to concentrate oneʼs sampling on one point, but to spread samples out and gain experimental variety. Here, the one shot per sequence limit plays the role of the one sample per time-point limit in the earlier discussion.