Artificial intelligence (AI) is already a fundamental tool for the advancement of science. “As a computer scientist, I believe that the human being is the most complex program ever written. It’s amazing to be able to access a piece of that code,” says Pushmeet Kohli, Vice President of Science at Google DeepMind. He knows what he’s talking about. His boss, Demis Hassabis, and his colleague John Jumper have won the Nobel Prize in Chemistry this year for their contribution to “the prediction of protein structure through the use of artificial intelligence.” The award recognizes the usefulness of AlphaFold2, the tool that has managed to describe the three-dimensional shape of 200 million proteins, key information to understand the functioning of organisms.
Kohli supervised the team that wrote the code for AlphaFold2. He directs some 150 researchers who make up the most purely scientific leg of DeepMind, the division of the Google company that – according to what they say – lives outside the commercial needs of its parent company and that does not participate in the generative AI race. The 43-year-old machine learning and computer vision expert grew up in the foothills of the Himalayas in Dehradun, India, and moved to the United Kingdom to complete his studies. After completing his doctorate at the University of Cambridge, he was hired by Microsoft, where he became research director. In 2017, Hassabis asked him to take charge of DeepMind’s scientific projects.
For Kohli, AI has opened a new horizon. “In any area of science you look at, AI is transforming what can be done,” he tells Morning Express after speaking at the AI for Science forum, organized in London by his company and the Royal Society.
Ask.Is there any scientific discipline that cannot benefit from the boost of AI?
Answer.If you can formulate the scientific question you are working on as a reasoning problem or as a pattern recognition problem, where certain conclusions have to be drawn from the data, then AI has a lot to contribute. A common mistake is forgetting that you have to be able to capture data from the physical objects you are studying. For example, it makes no sense to make models that try to predict emotions, because the data you will train them with are subjective reactions from humans who have seen this or that facial expression or body language in certain contexts. It is very important for us to know the limitations of the models.
Q.What type of projects are you interested in?
R.We have a lot of work around biology. We have touched on structural biology with AlphaFold, but we are also very interested in genomics: we want to understand the semantics of DNA, to know what happens with the problems of variants of unknown meaning. That is our next challenge. If there is a mutation in the genome, what specific effect does it have? We are also working on new materials, we believe there is a lot of potential to advance there. Other important areas for us are nuclear fusion, climate and basic science related to mathematics and computer science.
“Congratulations to John, the #AlphaFold team, and everyone at DeepMind & Google that supported us along the way – it’s an amazing award for all of us! It’s such an honor and privilege to work with all of you to advance the frontiers of science.” – @DemisHassabis
Find out more… pic.twitter.com/XAr86gFEf3
— Google DeepMind (@GoogleDeepMind) October 10, 2024
Q.What objectives are set in the areas of fusion and new materials?
R.In nuclear fusion, the goal is to maximize the time we can keep the plasma stable. When the fusion reactor is turned on, our AI system controls the magnetic field, which has to be subtly modulated without causing disruptions that destabilize the plasma and maintaining the appropriate temperature and friction. In terms of materials development, the goal is to develop new materials that, when tested in the laboratory, we see that they are synthesizable and stable.
Q.He says that, in the area of genomics, the goal is to understand the semantics of DNA. At what point are you in that process?
R.The human genome project read the 3 billion characters of the code that makes us who we are. It turns out that all those letters have a meaning, a purpose, that we currently don’t fully understand. There are two components of the genome: the coding part and the non-coding part. The first talks about which proteins will be expressed; the second, on the regulatory mechanisms that say how much protein should be expressed, etc. Well, for the coding part we are already making predictions with a high level of reliability. We think we are close to being able to say whether certain mutations are going to be problematic or not. But knowing how and why they will be problematic is still an open area of research. And the same thing happens with the non-coding part: we want to know how protein expression happens. There is no horizon to finish the project right now. But, when we do, we will truly have an understanding of the language of life. And then we can start thinking about how to edit the genome to achieve certain goals.
Q.To what extent has the race for generative AI, which in Google’s case Gemini capitalizes on, distracted the company’s other lines of research?
R.Generative AI is a very powerful concept, also for science, because it has unlocked something new. Until now, a lot of our effort was focused on leveraging data that was structured, in the sense that you had a sequence and a prediction, you could see the results in tabular form. Now, many scientific advances are contained in articles in text form, so we were no longer able to apply AI to it to take advantage of that kind of intuition it provides. The great language models have allowed us to extract knowledge from that scientific literature. So in a sense, generative AI is helping science because it opens up a new field.
Q.Generative AI relies on giant databases, which have already exhausted the entire internet. It is beginning to be said that the next models will be trained with synthetic data, those created by machines. How do you see it?
R.I think the older a model is, the more expressive it is, the greater level of freedom it has. With more data, we can have more oversight and control over what the system is going to learn. But this is not a question of size, what really matters is the diversity of the data, that they provide the model with different types of problems from which to extract intuitions.
Q.Does synthetic data achieve that?
R.It is not something that works in all cases. Typically, we use data that has been obtained by performing experiments. In the case of AlphaFold, it was trained on a database with 150,000 proteins and, after training, we were able to predict the structure of more than 200 million. In some cases, we use simulations. That’s what we do in our work on nuclear fusion: we try to look at possible ways that plasma can behave to see how to control it, with the idea that, when applied to the real world in a nuclear reactor, the system will be able to generalize. And finally, there is the idea of synthetic data, generated by AI. In some cases, you can have the model produce certain types of data that were not present in the training database. For example, let’s imagine that in the original database we only have images of green chairs, but in a synthetic database, since we know the concepts of blue or red, we generate chairs of many different colors. So the final model will be able to understand that the chairs can be of various colors and detect them.
Q.What kind of problems can be solved with these types of models?
R.This can be applied to almost any problem we can imagine, but it may not work for everyone. We still don’t have a theory to understand when synthetic data is useful. But in some cases we have verified that, using this technique, we obtain an improvement in system performance.
Q.What proportion of synthetic data do you use?
R.We are investing in those three types of data sources. Especially in simulation, which is the most effective source and can be controlled. We are using synthetic data, but with caution: it is very important that the original model is good, because if not the result is useless.