AWG Quantum Synthetic Data API

Hello, community! First of all, apologies for the months of absence. We’ve been fully immersed in developing Auren, our platform for early cancer detection :coffee:.

Now, let’s get straight to the point:

Some time ago, we realized we needed better data to train our tabular models. We experimented with GANs, which are powerful, but encountered issues like mode collapse (where the generator gets stuck creating less diverse data) and instability during training. We then tried GENs, which are more stable, but the generated data didn’t improve loss reduction or increase training accuracy with Torch.

Finally, we discovered quantum randomness simulations. These allow us to generate high-quality synthetic data, particularly for omic tabular datasets. We use Cirq to design quantum circuits, as it works well on Metal. However, if you prefer CUDA, Qiskit might be a better option.

The Circuit We Use

Our circuit generates synthetic data through quantum randomness simulations by combining superposition, entanglement and controlled rotations to achieve a structured randomness pattern. Here’s an overview:

0: ───H───@───Rz(0.123π)───M(‘m0’)────────────────

1: ───H───X───@────────────Rx(0.456π)───M(‘m1’)───

2: ───H───────@────────────Ry(0.789π)───M(‘m2’)───

Step-by-Step Breakdown

  1. Initialization of Qubits:

• Each of the three qubits (0, 1, and 2) starts in the |0⟩ state, the default state in quantum systems.

  1. Hadamard Gate (H):

• Applied to each qubit at the start, this creates a quantum superposition, placing each qubit in a mixed state between |0⟩ and |1⟩. This introduces essential randomness into the circuit.

  1. Controlled Gates (@):

• These controlled gates (also known as CNOT) introduce entanglement:

• Qubit 0 controls Qubit 1.

• Qubit 1 controls Qubit 2.

• This means the qubits’ states become correlated, enabling the generation of complex patterns.

  1. Rotations (Rz, Rx, Ry):

• Each qubit is rotated in a specific way on the Bloch sphere:

• Qubit 0: Rz rotation around the z-axis.

• Qubit 1: Rx rotation around the x-axis.

• Qubit 2: Ry rotation around the y-axis.

• These rotations fine-tune the final quantum state probabilities, adding diversity and complexity.

  1. Measurements (M):

• Each qubit is measured in the computational basis (|0⟩ or |1⟩), producing outputs labeled as m0, m1, and m2.

Why This Circuit?

This design combines the best of quantum principles to generate tabular data with a complex and rich random structure, making it ideal for training machine learning models.

“Quantum Synthetic Data API for AWG”

We’ve created an API to let you generate quantum synthetic data without programming. Just upload a file to Swagger, execute the process, and you’re done. Here are the details:

URL: https://www.gaiahealthai.com:9000/docs#/

Platform: Deployed on a VM with 8 CPU cores, 8GB RAM, and SSD storage.

Response Time: Each inference takes about 2 :coffee: minute, processing the CSV and applying quantum randomness to return a new CSV with synthetic data.

Limitations: Each response is capped at 250 rows and 4 columns of the input sample due to the high computational demand of quantum simulations. Each simulation takes approximately 2 minutes to complete, so please be patient.

If you have any questions, feel free to send us a direct message or email at joan@totemhealthlab.org.

Note: The API is completely free and public for the AWG community. We hope you can use it, experiment with it, and make the most of it. Naturally, the API still has room for improvement, and we’ll share updates and enhancements with each new version.

We would also be delighted to assist with interesting projects. Please don’t hesitate to reach out to me with confidence.

@Laura @juanma @lauren.sanders

3 Likes

Hi Joan and team, thank you so much for developing this synthetic data generator API using quantum compute principles!

I did a quick test using data from OSD-48 and put my results in a Google Colab notebook HERE.

I used PCA and correlation as preliminary measures of similarity between synthetic and original samples - would you recommend other metrics? It looks like the PCA is picking up on a lot of variation between the synthetic samples, and I wonder if they would look more similar if we could input more than 250 genes in the future. Also something that was interesting is that the pairwise correlation is really high between a few samples that aren’t the same. Does the API keep the columns in the same order when generating the synthetic data? Just wondering if they potentially got reordered.

Any thoughts from anyone else? @james.casaletto @aamir @cmschmidt

1 Like

This is a very computationally expensive option and will not give you the result you expect because you are doing it through simulation. To achieve the randomness you are looking for, you need to do it on a quantum computer in production. In this case, it is merely very costly classical noise. Depending on what you need it for, it could still be useful, but keep in mind that it remains pseudo-randomness.

1 Like

A few months ago, I mentioned that it was simple, but I didn’t elaborate further because there was no context. If you tell me the level of entropy you need and if there is any aspect you want to preserve, I can easily program something. I try not to use synthetic data unless it’s strictly necessary, but I understand that in the context of space research, due to the limited amount of data, it is necessary.

1 Like

Hi Lauren, thank you for sharing your results and feedback.

  1. Additional metrics: Besides PCA and correlations, you might consider metrics like Jensen-Shannon or Kullback-Leibler divergence to compare individual column distributions. These could provide a clearer picture of the similarity between synthetic and original data.
  2. Row and column limits: The current limit (250 rows and 4 columns) may impact the ability to capture complex patterns like gene correlations. We’re working on optimizing this to handle larger datasets.
  3. Column order: The API preserves the original column order. Unexpected correlations between synthetic data might result from independent column generation without modeling relationships between them.
  4. Future improvements: We’re exploring ways to better capture inter-column correlations and optimizing the quantum generation process to preserve complex structures.

Thank you for sharing!

1 Like