Schools

Artificial Intelligence Aids UCSD Gene Activation Discovery

UC San Diego researchers have identified a frequently used DNA activation code.

SAN DIEGO, CA — UC San Diego researchers have identified a frequently used DNA activation code that could eventually be used to control gene activation in biotechnology and biomedical applications.

Their findings were published in Wednesday's edition of the journal Nature.

Scientists have long known that human genes spring into action through instructions delivered by the precise order of DNA, directed by the four different types of individual links, or "bases," coded A, C, G and T.

Find out what's happening in San Diegofor free with the latest updates from Patch.

Nearly 25% of human genes are widely known to be transcribed by sequences that resemble TATAAA, which is called the "TATA box." How the other three-quarters are activated has remained a mystery due to the enormous number of DNA base sequence possibilities.

With the help of artificial intelligence, the UCSD researchers say they have identified a DNA activation code that's used at least as frequently as the TATA box in humans. The discovery, which they named the downstream core promoter region — or DPR — opens up possibilities for another major portion of our genes.

Find out what's happening in San Diegofor free with the latest updates from Patch.

"The identification of the DPR reveals a key step in the activation of about a quarter to a third of our genes," said James T. Kadonaga, a professor in UCSD's Division of Biological Sciences and the paper's senior author. "The DPR has been an enigma — it's been controversial whether or not it even exists in humans. Fortunately, we've been able to solve this puzzle by using machine learning."

In 1996, Kadonaga and his colleagues working with fruit flies identified a novel gene activation sequence which corresponds to a portion of the DPR that enables genes to be turned on in the absence of the TATA box. However, since discovering a single gene with that sequence, deciphering the details has been difficult, he said.

Kadonaga and his team made a pool of 500,000 random versions of DNA sequences and evaluated the DPR activity of each. From there, 200,000 versions were used to create a machine learning model that could accurately predict DPR activity in human DNA.

Kadonaga describes the results as "absurdly good" — so good, in fact, that the researchers created a similar machine learning model as a new way to identify TATA box sequences. They evaluated the new models with thousands of test cases in which the TATA box and DPR results were already known and found that the predictive ability was "incredible," according to Kadonaga.

Going forward, the further use of artificial intelligence for analyzing DNA sequence patterns could increase researchers' ability to understand as well as to control gene activation in human cells, according to Kadonaga, who said the knowledge will likely be useful in biotechnology and in the biomedical sciences.

"In the same manner that machine learning enabled us to identify the DPR, it is likely that related artificial intelligence approaches will be useful for studying other important DNA sequence motifs," he said. "A lot of things that are unexplained could now be explainable."

— City News Service