Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Open
Yoshanuikabundi opened this issue Jan 10, 2024 · 2 comments
Open

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Yoshanuikabundi opened this issue Jan 10, 2024 · 2 comments

Comments

@Yoshanuikabundi
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
A zero-width space (\u200b) in a SMILES string causes RDKit to truncate the molecule without an error:

>>> from openff.toolkit import Molecule, RDKitToolkitWrapper
>>> 
>>> #      ZWSs here ‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾‾‾\/‾‾\/‾‾\/‾‾\/
>>> smiles = "COc1ccc(cc1)​C(=O)/C(C#N)​=N/Nc2ccc(cc2)​[N+]​(=O)​[O-]​"
>>> 
>>> Molecule.from_smiles(
...     smiles, 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
16
>>> Molecule.from_smiles(
...     smiles.replace('\u200b', ''), 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
36

I ran into this very confusing behavior while copying SMILES strings from specs.net.

Describe the solution you'd like

Ideally, RDKit would raise a parse error rather than just truncate, but we could solve this issue on our end instead.

  1. We could strip zero-width spaces (and other invisible, SMILES-irrelevant unicode symbols) from SMILES strings before passing them on to RDKit, but this may not solve similar problems with codepoints that look like ASCII symbols but aren't.
  2. We could normalize all unicode to the closest ascii-compatible character, dropping anything that is not ascii:
>>> import unicodedata
>>>
>>> normalized = unicodedata.normalize('NFKD', smiles).encode('ascii','ignore').decode()
>>> normalized
'COc1ccc(cc1)C(=O)/C(C#N)=N/Nc2ccc(cc2)[N+](=O)[O-]'
>>> Molecule.from_smiles(
...     normalized, 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
36
  1. We could also just raise an error if any non-ASCII characters are detected. Are non-ASCII characters ever used in SMILES?
@j-wags
Copy link
Member

j-wags commented Jan 10, 2024

If we choose to do anything on our end, I'm kinda in favor of 3. But since this has taken 5 years to emerge I think the best course would be to raise this on the RDKit issue tracker and see if @greglandrum invites us to submit a PR to fix upstream.

@greglandrum
Copy link

greglandrum commented Jan 10, 2024

I'd be happy to see a PR or bug report with examples to raise an error on stuff like this on the RDKit side.

Given that the parsing code is working purely with 8 bit chars and ASCII, there's no other sensible alternative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants