Open Source Drug Discovery: The GSK Arylpyrrole Series

Several labs are leading an open source drug discovery project for malaria, including the Todd lab at the University of Sydney, the Medicines for Malaria Venture and GSK Tres Cantos, but the project requires many other partners. The aim is to prosecute a hit-to-lead campaign starting from known actives in the GSK data set. Some background is here. The current project status is kept up to date on this wiki page.

Discussion: this site (daughter pages below)

Data: The electronic lab notebook is here.

Project status.

Updates: via a Twitter feed, and a Google+ page.


Project is open source - if you're reading this and would like to participate, you can.




A list of what's needed in the OSDD Malaria project

Main: Comment/Analysis on Initial Bioactivity Data

We have some excellent first results. What to do next?


Resynthesis of TCMDC-123812 and TCMDC-123794 (ELN)

Need: advice on oxidation of pyrrole-3-carbaldehydes - here. Through a work-around, these syntheses are complete, but we could still ue advice on the step above. Done


General Analog Synthesis Planning

Need: Advice from med chemists on what to alter first - here.


Biological Evaluation of Initial Leads

Need: Advice on what kinds of biological evaluation are most desirable to validate the initial GSK leads - here.


Where Else Can we Access This Series?

Need: People with stocks of analogous compounds (i.e. members of the arylpyrrole series) to submit those compounds for screening. First possibility seen here.

We're compiling a list on OpenWetWare of compounds we would like to source.

Desired Compounds Consultation


Request for Help

We're looking to identify a new set of compounds for the next round of optimisation. This is happening in addition to sourcing of commercially available analogues that will fill a bit more of the SAR space but aren't necessarily exactly what we want.

We've posted a list of compounds on OpenWetWare that contains a range of the things were after, along with a SMARTS filter summary.  We've had feedback and ideas submitted before here on TSL and some of those ideas have been included in this list. Below I've posted the 10 "priority" compounds from the list. It's very much open to debate so get involved. Once we've had a bit of feedback, we'll settle on a definitive list and go after them by any means necessary. The plan was to mostly concentrate on the side-chain, leaving the aryl pyrrole unit mostly untouched for the moment.


[Edit 0905 AEST 15 June 2012: Added letter identifiers for compounds to aid discussion.]

Desired Compounds Consultation Phase 2


Request for Help

The evaluation of the arylpyrroles has gone well, in that we've identified promising new antimalarial compounds. Besides their high potency, they exhibit high levels of activity in a late-stage gametocyte assay which is very exciting. (As an open source project, anyone may take these results and work on them - made easy by all our data being available.) It's for these reasons of potency that we're going to explore one more iteration of the series, despite three of the compounds showing no oral activity in mice. It's thought the problem could be low solubility. This round will only be including compounds with low (<5) logP, and we'd like to play around with the structure a little more.

If this round does not throw out any improved compounds we'll probably park the series. Hence it's important that we choose a good set of compounds to evaluate. We decided to list the top 10 "most wanted" compounds that we could access commercially, as well as a similar list of compounds we could not buy and wanted to make. We'd then attempt to source those commercial compounds, and ask the synthesis community to volunteer to make the other necessary compounds.

We're now assembling the lists. We'd compiled a first-pass list of attractive compounds. We've now modified that list to give two new lists of commercial vs. synthetic compounds - below. We now need to consult the community again on these new lists. Before embarking on synthesis or purchase we will have the compounds checked by the original authors of the GSK TCAMS set to see whether any of the compounds have been evaluated and found to be inactive - we'll send the SMILES of all these compounds to GSK and see what they say. We know that's a big ask.

First the compounds we'd like to get our hands on which are commercially-available:

If any of these are known by GSK, we'll fill up the spaces with compounds from these backups, or any others people might like to see tested:

The compounds we'd like to evaluate which are not commercially-available are:

(note that primary, secondary and tertiary terminal amides are all of interest here and ought to be made concurrently.

And again, these are the backups in case these compounds have already been evaluated:

In the synthesis set:

1) We've included a couple of pyrazoles. Fused pyrazoles quite different to the GSK hit compounds are commercially-available, and we've not included those because of the substantial differences - those options are shown here, and we could include some if needed.
2) We've taken the curveballs out as being a little speculative, but if anyone knows how to make these, or wants to have a go, please say.
3) We've de-prioritised the thiazolidinones as being too insoluble, even with some obvious tweaks. This means we have three slots available for the synthesis "top 10 most-wanted." Our final phase of consultation will be to fill those slots.

The final consultation will hopefully be a public hangout on the web for a final discussion, technology permitting. Date to be advised. This will finish the "Most Wanted" lists and begin the next phase of compound evaluation. So this is where we stand - would anyone do this differently?

On a side note, that will probably need a post of its own, the logPs in the above are approximate. We've been using available tools to calculate these, e.g. Chemdraw, but there's a lot of variability depending on the tool used. We have no access here to one that performs well, from ACDLabs. While it's likely the above figures are inaccurate (vs. truth) it's unlikely they are so far out as to invalidate a target).

Consultation Outcome



Firstly, thanks for coming to the online meeting. I found it worked quite well (minor glitches aside) but it would be great to hear your thoughts. We'll post the recording up in the near-ish future. The OpenWetWare wiki will soon be updated to reflect the outcomes of the meeting, along with SMILES. Of course if anyone is keen and beats me to it, then even better.

The discussion focused really on the selection of synthetic compounds. The list of commercial compounds (below) remained the same. The project is now looking for willing donors (ca. 5 mg) for these compounds. 


The list of synthetic compounds saw some changes. Partially because some of the original list have already been made. The replacements were discussed and found for these and the blank spaces were filled. These compounds look to mitigate the problems observed with the previous rounds of testing. Please let us know if these structures aren't what you were expecting to see here. The two compounds highlighted in blue are one's that are currently receiving attention here at Sydney. 

Final top 10 synthetic targets


The project now needs synthetic teams to investigate the other targets. Any groups looking for academic collaboration or industrial contributions would be gratefully received. It could potentially be a good way for a CRO to showcase their expertise in turning out compounds. We would be willing to provide starting material if necessary.


Analog Synthesis - Variation of the Aryl Ring

Open source is most powerful when people participate by creating. Open science is no different, and in the case of lab-based sciences, that means actually doing experiments. For the open source drug discovery for malaria project we need people to make molecules. In fact a lot of people need to make molecules. We have our first offer (November 2011).
Sanjay Batra at the Medicinal and Process Chemistry Division of the Central Drug Research Institute in Lucknow, India, has offered to ask a student to make some molecules as part of the current push to validate the GSK aryl pyrroles (thanks to Saman Habib for putting us in touch by email - Saman is going to be leading the Indian OSDD Malaria project that is starting in 2012). This opening post describes where we are, and what I think needs to be made next (though the post may change over time as the project changes).

Below are the compounds sent last week (Nov 24 2011) for biological evaluation. Included are the original TCAMS compounds, some “near neighbour” compounds, and a range of pro-drug possibilities (i.e., if the TCAMS compounds are actually prodrugs, given that that ester is unlikely to survive for long). Most were made by Paul Ylioja, and some by the undergraduate student Paul was mentoring, Laura White, who posted a nice report of what she did here. The compound codes will allow you to find the procedures in the ELNs.


According to wisdom received from the GSK Tres Cantos and MMV guys, we should be doing a broad and shallow SAR search, which is to say we ought to be picking several points of variation in the TCAMS structures and making a small number of changes in each position rather than exhaustively changing one position. The rationale there is that we need to see that the potency varies when we change things, otherwise there are a bunch of other hit series we can look at.
I think that means the best thing for Sanjay's lab to do is to finish off making variations in the aniline of the arylpyrrole synthesis – i.e. vary the fluorine:

We've done some of this – converting the F to H, Me and CF3. Not all of these have been taken all the way through to the end yet. Our undergrad student Zoe is working with Paul Ylioja to make a 3,5-CF3 variant. But I think it's important we change the position of the F, that we change the Ph ring to, for example, a pyridine, and we bulk out the ring with something like two methyls. I also think the biphenyl would be a good one to try (i.e., use 4-phenylaniline). Which compounds are made depends on which starting materials are available. I think we need 3-4 diverse anilines taken through to the end, so 8 final compounds. Whether intermediates should be saved for screening depends on whether the "prodrug" compounds sent for evaluation above look promising.

The pyrrole esters would then need to be hydrolysed, and coupled with the TCAMS R' groups according to procedures Paul has nicely worked out. Typical procedures are given as red URLs above, but generally the chemistry can be browsed at the ELN.
It would help a great deal if Sanjay was able to use the same lab book that we are using, i.e., to start an account on Labtrove and start a separate blog on this page called something like “CDRI Synthesis of Aniline Variants” where the experiments would be posted (we can create this if it's not easy/obvious). Crucially, this is an open science project, so all data must be deposited – check out the Six Laws. Our labs will be geographically separated, so we must have full access to each others' data. This also means that readers of the project can have faith in what we're doing because they can check the raw data.
I hope this project idea sounds good as a starter, Sanjay, and your students are happy!
Biological evaluation of compounds would happen either here in Australia, or better in India, if we are able to establish a willing venue for that. I suspect that will be no problem, but is not currently sorted out. The assays for the compounds made need to be similar to the ones being done elsewhere, and the same control compounds should be used. The controls should probably include one of the original TCAMS hits, and we could provide that compound if and when it's needed.

Note if you're reading this and want to take part by making some molecules, please say. You're both welcome and needed, provided you subscribe to the Six Laws. There's so much to do, we can't do it all on our own. Similarly, if you're a medicinal chemist who just can't help themselves, and think we're approaching this all wrong/right, please feel free to say why. Discussion can happen here. Project status will be most up to date on the wiki. You can tweet the project. Or you can catch up with some of us on Google+, which is a pretty useful addition to the project tools and gets us away from private email, which is generally useless for an open project.


Availability of TCMDC-123812 and TCMDC-123794

We're starting open source drug discovery for malaria. We have to start somewhere: in this case a couple of known compounds that showed good activity and have plenty of possibilities for modification - compounds contained in the open deposition of malaria data from 2010, originating from GSK's Tres Cantos lab.

Before getting too excited about these leads, we must validate them, meaning we need to obtain samples and screen. Paul Ylioja is currently making these compounds, and the chemistry is going very well, helped in part by his pyrrole wizardry. Please analyze and comment on the lab book, particularly if you're a synthetic organic chemist.

SciFinder and Google searches on the SMILES/InChIs for these structures throw up very little, but it turns out they are commercially available from a number of suppliers (Paul first spotted this). We corresponded with Felix Calderon at GSK Tres Cantos, who said that, indeed, these compounds had been bought in from the Enamine library. Tres Cantos have stock of these compounds in Madrid, and have kindly offered to look into them further if needed. Potential evaluation will be dealt with in another post elsewhere.
Given we will be wanting to modify the structures, we need to be able to synthesize them rather than buy them.

But I wonder why these compounds were made in the first place?

Biological Evaluation of Arylpyrrole Series

New Stuff:
Online lab book hosting bioactivity data is here.
First set of compounds have been evaluated (Jan 2012) - here.
Initial phase of this project is to validate the biological activity of the two Tres Cantos leads. The promise of these compounds (and others) is discussed in a paper linked here.

Original activity data for the two compounds are here and here.

For this initial phase, the question is: What kind of biological (re)evaluation is needed? (not toxicology, just activity)

For experiments, Tres Cantos (Felix Calderon) kindly offered to re-evaluate these compounds. We also have links with other labs who have expressed an interest in this project (the Eskitis Institute in Queensland or Stuart Ralph's lab in Melbourne). Question is, what data are we looking for?

In our original proposal for this project, we assumed the following assays would be needed in general. Are all these needed for validation of the current two compounds, or only later during analog evaluation?

1) A primary whole cell parasite assay covering a sensitive and resistant falciparum strain (3D7, Dd2 and W2mef). (Screening for activity would use an image based anti-malarial HTS assay incorporating DAPI or SYBR-Green dyes to monitor parasite growth: asexual and, potentially, gametocytes.

2) Assay for information on the selectivity between drug resistant and sensitive falciparum strains, as well as possible cytotoxicity on mammalian cell lines (typically HepG2 or HEK293 cells), to check for a high therapeutic ratio.

3) For compounds that inhibit growth selectively, IC50s should be determined using serial dilutions of inhibitor in 48 and 96 h assays, which will allow us to screen for promising cell-permeable inhibitors and to discern immediate and delayed parasite death – suggesting whether inhibition is of cytosolic- or apicoplast-based targets.

In our correspondence with Felix, he said the following:

1) The antimalarial activity of these compounds is not affected by the presence or absence of folate in the culture medium, implying they are not inhibitors of the folate biosynthesis pathway. Is this of general significance since it steers clear of well-established resistance mechanisms? (review)

2) The compounds are neither bc1 nor DHODH inhibitors. Why is this important?

3) Felix would be happy to determine the IC50 for these compounds in the standard hypoxanthine incorporation assay (48 h). Determination in the original Tres Cantos dataset was measured at 72 h using the LDH assays. Is this difference in assay significant/desirable?

These questions are intentionally naive, because though there are many options, we need a consensus on what people will be looking for in validation of the existing compounds, and why.

Biological Results for First Set of Compounds

The screening data from three separate labs have been obtained for the first set of compounds on the project. Data were obtained from the Ralph Lab at the University of Melbourne, and a second data set was provided just before Christmas by the Avery Lab at Griffith University. Yesterday the third set was provided by GSK Tres Cantos in Spain, who originally discovered the hits we're starting with. The current list of available compounds in this open project is here, with those that have been evaluated by at least one lab indicated in the relevant column.

Having data on the same compounds from three labs using different screening methods is useful as it provides contrasting ways of assaying effectiveness. In any given screening experiment on this project it's going to be important to include known actives, so that we have benchmarks, and this was done in these cases. It's also very important to be 100% sure about the effectiveness of a compound before we become too attached to it...

The data (below, but all available through the relevant lab book) show that the original TCAMS compounds are certainly active, though perhaps not quite as active as suggested by the original screen. Paul Willis at MMV had suggested we also check out some "near neighbors" of these compounds that were in the original data set. We made a couple and one (a novel compound with the code PMY 14-1, shown below and synthesized here) has shown promising activity in all three screens, with Avery/GSK IC50s coming back as low nanomolar. (Note that this project will never involve patents or closed data, giving us the freedom to discuss the compounds freely.)

What's next? In the short term: We're waiting for confirmation of the Melbourne data via a re-run of some of the experiments. But what we need is an expert qualitative assessment of these bioactivity data by someone familiar with such screening assays. Either in comments below this post, or on G+, not by email. First item of business in the lab is to generate a few variants of PMY 14-1. We already have some new relevant compounds and are now planning others. What should we make - i.e. how ought we to change PMY 14-1? Sanjay Batra has students who are about to make steric variations in the aryl pyrrole, and these could then be employed in the synthesis of PMY 14-1 variants, for example, but shouldn't we also be interested in changes in the "upper half" of the molecule?

In the long term: It would be good to find other labs which already have analogous compounds to the actives. Paul and Zoe found a paper from the Roberts lab at Scripps describing a number of such compounds, and I will write to them to ask whether they are interested in having the compounds be screened for their antimalarial activity. If anyone knows of any other possible sources, that would be great, since using existing compounds saves a lot of time in the lab.


Biological Results for Second Set of Compounds

In January the first biological data for compounds from the open source drug discovery for malaria project came through. The compounds were based on two hits identified in the GSK Tres Cantos set (TCMDC-123812 and TCMDC-123794). The two originals performed well, and we also identified two other compounds that looked promising (PMY 14-1 and PMY 14-3-A). Biological data were obtained from three labs (the original GSK lab, Stuart Ralph's lab in Melbourne and Vicky Avery's lab in Brisbane) and compared to known antimalarials.

Since then Paul and Zoe have been making a second set of compounds, which we shipped last month. Details of those compounds are in this spreadsheet. They are intended to explore the most promising compounds from the first set.

The first biological data are now back - from Vicky Avery's lab. We have some super-potent compounds, which is very exciting. One is picomolar (the data below are the average of two runs on 3D7). The data are posted raw here, and are summarized below (direct link to picture file).

A few obvious points:
1) We're eagerly awaiting the data from the other two labs, to see if the activity is confirmed.
2) The QSAR isn't flat - i.e. changes to the structure of the molecules make a difference to the bioactivity.
3) The aryl pyrrole appears to be needed in all sets.
4) Replacement of the ester with an amide in the original GSK compounds is seriously deleterious.

What's needed:
1) The most potent compounds have high logP. We're going to need to make them more aqueous soluble.
2) The best four from the first round and four from this second round are going to be shipped for basic metabolism assays to Sue Charman at Monash.
3) We're hoping to send 2-3 compounds for in vivo evaluation. Possibly the two originals, plus one of the super-potent compounds. Awaiting confirmation that we can do that.
4) The work that Sanjay Batra at CDRI is doing on installing sterically demanding groups on the aryl ring in place of F will be an important addition here.

1) What do we do to decrease logP?
2) There have already been some good suggestions on how to change these compounds by modification of/introduction of other heterocycles. We think this is still the way to go for round three. Everyone agree?
3) Does the lack of activity for compound ZYH 23-1, and its laughable lack of reactivity towards hydride reduction, suggest we need not worry about these compounds being PAINS?

If you've any other gut feelings about these compounds, or if you'd like to play with them in your lab, or if you spot some chemistry you'd like to do to make a related scaffold, please say.

To re-state the obvious: this is open source, meaning you can join the project, or take what we've done and use it in your own research, with attribution (CC-BY-3.0).

Biological Results for Third Set of Compounds



A summary of the biological activities obtained for the third set of compounds in 2012 - those arising from the consultation for which synthetic and commercial compounds that were most wanted. See also some links for analysis of trends in the data.

First set (Oct 19th) Data, and these were discussed briefly in an online meeting.

Second set (Nov 8th) Data, and discussion

Third set (Dec 10) Data (essentially inactive, aside from mild activity for OSM-S-103


Previous discussion of these data, highlighting trends.

Importance of primary amide side chain.

Impact of replacing ester with amides and amines. And impact on the two original GSK compounds.

Dramatic impact of methylation of the hit compound.

Low efficacy of pyrazoles.

Prodrug hypothesis II and III

Reminder of the efficacy of the near neighbour thiazolidinones.

Suggestion of next compounds, including hybrids, and the WANTED! compounds.



Late Stage Gametocyte Assay for Arylpyrroles



Four of the arylpyrroles/near neighbors have been tested in a late stage anti-gametocyte imaging assay, with interesting results.

This assay is less usual than other malarial assays (because it is technically more challenging). See this paper for a clear description of the importance of the gametocyte stage of malaria. The upshot is that drugs targeting this form of the parasite (the late stage gametocyte) are particularly valuable because they could help prevent the transmission of malaria.

In fact in the above paper many antimalarial compounds did not display activity against LSG, with only methylene blue reaching an IC50 of 12 nM. Indeed more generally it seems that there are few compounds that have been identified with this activity.

Interestingly the novel compounds screened from the arylpyrrole set (but not the original GSK compound OSM-S-5) were highly potent (nanomolar) in this assay. The data are posted in the lab notebook. That's pretty interesting.


Metabolic Studies on a Set of Arylpyrroles



Eight compounds - two GSK originals and 6 promising-looking compounds made during this project were examined by Sue Charman's lab at Monash for metabolic degradation in vivo. In human terms that means they were tested (on a simplistic level) to see whether the compounds would last long in the blood or whether they would likely be metabolised. As part of these studies the solubilities of the compounds were evaluated.

Compounds were as follows. Note the use of the new "OSM-S" notation which we're introducing to give a compound a unique ID tag, independent of source, though there will probably always be trailing synthetic-prep tags.

The results are posted here.

As is often the case low degradation rates (good) come at the cost of low solubility. We kind of expected this based on the high logP values for some of these compounds. Presumably analogous compounds could be re-examined when more soluble.

hERG Assay for Two Potent OSM-S Compounds



One of the most potent compounds identified to date in the OSDD malaria project, the near-neighbour analog OSM-S-35 (ZYH3) was subjected to the hERG assay along with one of the original TCAMS GSK compounds which in this project has the tag OSM-S-5.

Raw data are here, plus spreadsheet here.

These results suggest that these compounds are "misses" in this assay, implying that they, and perhaps the series as a whole, would not have cardiac side effects as drugs.

Courtesy of Paul Willis at MMV: The human ether-a-go-go related gene (hERG) encodes a potassium channel in the heart (IKr) which is involved in cardiac repolarisation. Inhibition of the hERG channel can cause ‘QT interval prolongation’ resulting in a potentially fatal ventricular tachyarrhythmia called Torsade de Pointes. A number of drugs have been withdrawn from either the market or from late stage clinical trials due to these cardiotoxic effects, therefore it is important to identify hERG inhibitors early in drug discovery. Therefore the fact that these compounds are inactive at hERG is good news (Much is understood about the pharmacophores that hit the hERG channel so I was not expecting an issue for these compounds but it is always good to confirm).  A hERG inhibition at an early stage is not a show stopper but a clear issue that has to be addressed in the optimization process.




Druggability of the Arylpyrrole leads (TCMDC-123794 etc)

Project is starting with a couple of leads from the GSK Tres Cantos set. There is a newly-published analysis of the druggability of the compounds in the original data set. The arylpyrrole series is listed as one of the most promising (though the Aryl-F is missing in the published paper - presume that is a clerical error).
Tres Cantos are appealing for collaborators to work with them on these compounds, which is an excellent idea. That's what we're doing, except that the project hosted here is open source, meaning anyone can see what we're doing and guide the direction of the project.
The initial phase is the resynthesis of these leads and their validation. We will soon be moving to analog synthesis. The obvious first question for the community of medicinal chemists is: what should we change?
My gut feeling was to verify the need for the aryl-F and the methyls on the pyrrole. Paul Willis' gut feeling was that ester. Gut feelings and half-formed thoughts enormously welcome as comments below.

Known Near Neighbours of Initial Tres Cantos Leads

We're starting with the resynthesis of two leads from the GSK Tres Cantos dataset. The obvious question is: are there other, related structures in the dataset that might give us information on what to change next?


Paul Willis from MMV did a quick "near neighbour" search (25 Aug 2011), particularly with an eye to getting rid of the ester in the lead structures. Structures below. As he said: "The first compound is a ketone analogue of the ester lineage – it's a singleton and not an ideal group from a drug discovery perspective but indicates other groups may be tolerated.  The next set is the entire cluster output of another near neighbor I spotted – all have replacements for the ester group (again not especially drug like but possible indication that wide variation possible at this position) and interestingly some contain variations on the 4-F-Ph"






1) Are there other structures that are "similar" in the GSK set - searchable at Chembl.

2) What do these structures tell us about what to change next in the lead compounds?

3) Importantly, can we gain access to analogous structures that have been evaluated but not reported?

Interestingly this includes compounds that are related to the TC hits above but which were perhaps assayed against other targets.

Data for the above compounds will be posted to the wiki page. Discussion of them can more easily happen in comments below.

Synthesis Strategy for Near Neighbours

The evaluation of a synthesis strategy toward "near neighbours" of the TCMDC-123812 and -123794 is underway. Current efforts summarised as shown below:
Near Neighbour Synthesis
The experimental details are outlined in experiments PMY 13-1, PMY 14-1 and PMY 16-1. The synthesis appears straightforward but the final hydrolysis step resulted in material that was contaminated by grease (presumably from lab glassware). I'll update when I've repeated the reaction, while avoiding introduction of grease!

Prediction of Biological Targets of Actives


Request for Help

One of the interesting features of the GSK set of antimalarial compounds that are acting as the starting point for this project is that they are whole-cell actives, meaning that though they are extremely promising hits, we don't know how they work - i.e. what the targets are. To some extent this doesn't matter - praziquantel has been used for over 30 years and nobody knows how it works. However, a combination of factors (ease of regulatory approval, the possibility of some rational drug design, sheer curiosity) means it would be nice to know what these antimalarials are actually doing. How to figure that out?

There are ways. One is to use predictive cheminformatics - to use a correlation of all the known drug vs. drug target matches that are known, and to extrapolate that model to a molecule of interest. This exact part of the open malaria project was in our original grant proposal as something with which the core team had no expertise and so was an area where we were going to have to appeal for help. One of the super nice extra features that such an approach can bring is to predict off-target effects, which can help make a drug more effective (for example in this tremendous paper).

Last week such help arrived. I was talking with John Overington and Iain Wallace from ChEMBL about uploading the data from our project to their database (about which more shortly). It was an extremely interesting conversation. Iain has an interest in target prediction. He'd already taken the most active compound from our last round of biological evaluation and run it through his system to predict the likely biological targets of the drug. The raw data are here. The outcome from this search were these possible targets:
1. Carboxy-terminal domain RNA polymerase II polypeptide A small phosphatase 1
2. Dihydroorotate dehydrogenase (DHODH) - MMV/GSK have run these assays, e.g. here.
3. SUMO-activating enzyme subunit 2
4. SUMO-activating enzyme subunit 1
5. Cyclin-dependent kinase 1

What can we do with this information? We can try to find someone willing to screen this compound against those targets directly, to see if they are really targets. Anyone running these assays?

Iain's method is described in the online lab book, but he says it's this in essence: "Basically, a naive bayes model is built to distinguish compounds that are known to bind a particular target in ChEMBL from all others. We repeat this procedure for ~1,300 targets creating a model for each and score a compound with each model. I then generate the reports for only malaria proteins."

Iain has also repeated this analysis for the entire malaria box. This is significantly awesome work. Does this, I wonder, change the perception of which series are of particular interest?

It's important to bear in mind that these are preliminary results, as with everything in an open source project, and should be taken as work in progress. Iain understands this and wants to make sure everyone else does. Iain also points out that similar approaches have been used to successfully to identify novel targets of FDA compounds (see here and here), and the Shoichet lab have a nice webserver that can used interactively.

The other way of doing target prediction is experimental. Iain mentioned a couple of guys that might be perfect for this - Corey Nislow who runs a yeast-based assay for target ID, and Andrew Emili who is developing a proteomic-based assay. They're both at the University of Toronto, along with Gary Bader, whom Iain also suggested we contact. I'll reach out to see if they're interested. Any advice on the best approach gratefully received - chances of success here? Favoured method for target ID?

Bioisosteric transformation maps


Request for Help

It is common in drug discovery to have a highly potent hit that has to be optimised to remove undesirable characteristics such as poor oral bioavailablity, metabolic stability or toxicity. In our case, we have a number of highly potent compounds that have quite a high LogP, which is considered a warning sign for both a promiscousity (i.e. binding to many compound targets in vitro) as well as poor oral bioavailabilty (as it breaks the Lipinski Rule of 5).
One approach to solving this issue is the concept of bioisosterism. From Wikipedia, "bioisosteres are substituents or groups with similar physical or chemical properties which produce broadly similar biological properties to a chemical compound. In drug design, the purpose of exchanging one bioisostere for another is to enhance the desired biological or physical properties of a compound without making significant changes in chemical structure."
One such example would be replacing a hydrogen with a flourine at a site of metabolic oxidation. "Because the fluorine atom is similar in size to the hydrogen atom the overall topology of the molecule is not significantly affected, leaving the desired biological activity unaffected. However, with a blocked pathway for metabolism, the drug candidate may have a longer half-life."
To that end I have created biosteric transformations using the Pipeline Pilot software programme for the compounds synthesized as part of this project. Two different approaches are implemented to generate the transformations:
1) Classic Biosteres involve transforming the original molecule based on a set of ~200 commonly used transfomrations, such as replacing a hydroxyl with a sufonamide. 
2) Database Biosteres involve transformations based on an algorithm described in this paper
"M. Wagener, J.P.M. Lommerse, “The Quest for Bioisomeric Replacements”, J. Chem. Info. Modeling, 2006, 46(2), 677- 685"  
Focusing on just ZYH-3-1, I generated two reports (one for both methods) showing about 20 compounds resulting from such transformations that would all have ALogP < 5. It would be intereting to know how easy they would be to synthesize, as well as what would make sense to make based on what we know about the SAR of these compounds.
All the reports and data are available

Compound Similarity Networks from Iain



More fantastic work from Iain Wallace of ChEMBL, below. These are maps of active antimalarials and predicted targets, expressed as similarity maps, i.e. with an extra level of analysis added on top. This provides a very intuitive way of walking through related compounds to compare structures. How best do we use this kind of analysis - as a target guide, or as a "prediction of what to make next" guide?

Iain says:
I have now predicted targets for all the anti-malarial active compounds in Chembl-NTD (~20,000).  I have a full report for all these compounds, but it is quite large (~90mbs and 1200 pages) so I have displayed the results as a compound similarity network (posted here). In this network, compounds are represented as nodes and very similar compounds are connected by edges. Nodes are coloured by their predicted top scoring target. The names of the compounds can be viewed by zooming in very far and opening the file in Illustrator.
A similar map for compounds similar to zyh-72 [and one of the starting "Near Neighbour" set TCMDC 123563] in this dataset is posted here.
Also [on the two pages linked above] are two original networks that were generated using cytoscape ( If you install the cytoscape plugin "ChemViz", (, you can right click a node and view the compound structure. You can also view the target of the compound.
I think networks are useful way of visualizing/integrating different types of information (for example and I would be interested to hear if you had any thoughts of how to make this type of visualization more useful. For example the similarity measure I am using may not be finding molecules that you would expect to find.

Purchasable compound similarity maps


Request for Help

There are over 5 million compounds that are available to purchase according to the meta service, E-molecules (
It is worth exploring these in the context of the OSDD project as it will identify compound series' that are very easy to explore by purchasing analogues (i.e SAR by catalogue) aswell as identifying compounds that are potentially more sythetically accessible than others (i.e. if there many close neighbours these compounds might be easier to make than others). 
To this end I have generate three chemical similarity maps, showing compounds that are very similar to known anti-malarial's that can be purchased.

Maps generated

1) Centered on ~40 compounds synthesized by OSDD:
This is particularly interesting, as there are many compounds similar to ZYH 3-1 that could be purchased exploring both ends of the molecule.
2) Centered on ~550 compounds from GSK priortized into ~40 compounds series
This is interesting as it clearly identifies high priority series that are easier to follow up on than others based on the premise that if there are many close analogs available, it should be cheaper/easier to follow up on these clusters compared to clusters with no close analogs available.For example, the paper highlights five clusters for follow on work. #31 has the most similar available compounds  (~100 compounds), while #18 has the least, 1, which would suggest to me that it is much more efficient to follow up on the compounds in cluster #31 than any of the other high priority clusters. There are other very interesting clusters too, which perhaps have more interesting chemistry.
3)  Centered on ~400 compounds in the Malaria open box
Likewise, certain compounds in the Malria open box have many more close neighbours than others which could help prioritize compounds for the community.

It would be great to get feedback on this approach, namely:

1) Does this type of visualization work for chemists? Is it straightforward to download cytoscape, install ChemViz and load the network?
2) What do we know about the SAR about these compounds? This would help to priortize/focus our search on chemical space. While I used ECFP_4 fingerprints, other similarity measures can prioritize other features differently (e.g. if we know a paticular substructure is key, then all compounds should contain it etc)
3) While I think the network view is great for a global overview of the compounds available (and can be overlayed with any other types of data that we can thing of, such as predicted targets etc), perhaps there is a better visualization for a smaller number of compounds?

Proposed resynthesis strategy for TCMDC-123812 and TCMDC-123794

Below is a proposed synthesis strategy to the two members of the TCMDC aryl pyrrole series. Please comment, give your thoughts and improvements.
Experimental attempts are documented on an electonic lab notebook found here:
Proposed synthesis strategy