top of page

The Data/ The Charts/ 

The Composition/ 

The Code/ 

The Data

So how do you take 30,000 letters and make music? Well all the letters had to be organized. The best way to take a random, patternless string of letters was to divide each string by its protein, and then look at the frequency of different bond groupings. There are 26 different proteins that make up SARS-CoV-2 each of which range from 117 letters to 5,835 letters. Which is a LOT of letters to filter through before finding some pattern. But to start counting the frequency of each individual A, G, C, U , main double bonds GU, AU, UG, and all the remaining double bonds AA, AG, AC, GG, GA, GC, UU, UA, UC, CC, CA, CG, and CU, I needed a little code and charts to help me out.

initialize string from txt
test_str length
test_str to list
for loop to count individual chars
individual char count via hard coding
count of 3 main bond pairs
count of all other bond pairs

The Charts

Taking the python code outputs and converting them into a plain data table made the strings of letters more tangible. I was getting close to a pattern. From 30,000 random letters, I knew that I could find something from 546 numbers that had an association between protein type and base combinations. From this large data set I uncovered 9 variations of data visualizations. The code was starting to crack visually. Which meant that step three: developing a graphical score from the data visualizations, was starting to materialize. 

SARS-CoV 2 Genome Map

Click & Drag Code

to get a closer look

With a little bit of python coding I extracted the frequency of each letter and bond using string/character/list conversions and for loops from individual text files of each protein RNA sequence. Below is an example of the 1st RNA sequence along with the code. With this first step complete, I now had the data somewhat organized for step two: finding a pattern by converting numerical data into charted visualizations.

chart%201_edited.jpg

The Python Code

Click Data Table

to get a closer look

Hover Over Charts

to get a closer look

Click Image

to find its source

So we took the genetic code, we organized the RNA sequences, we converted the letters to numerical data, and we charted out the data with graphical visualizations of frequency to bond types. Where do we go from here? All we have are some pretty graphs, a simple python code, and a neat table with numbers. But what does this all mean? What can we do with this information aside from stare in awe at its chaotic beauty?

We have 

   DATA! 

      So..... 

          now what ? 

bottom of page