Zero-width spaces in SMILES cause unexpected behaviour. #1807

Yoshanuikabundi · 2024-01-10T06:17:00Z

Is your feature request related to a problem? Please describe.
A zero-width space (\u200b) in a SMILES string causes RDKit to truncate the molecule without an error:

>>> from openff.toolkit import Molecule, RDKitToolkitWrapper
>>> 
>>> #      ZWSs here ‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾‾‾\/‾‾\/‾‾\/‾‾\/
>>> smiles = "COc1ccc(cc1)C(=O)/C(C#N)=N/Nc2ccc(cc2)[N+](=O)[O-]"
>>> 
>>> Molecule.from_smiles(
...     smiles, 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
16
>>> Molecule.from_smiles(
...     smiles.replace('\u200b', ''), 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
36

I ran into this very confusing behavior while copying SMILES strings from specs.net.

Describe the solution you'd like

Ideally, RDKit would raise a parse error rather than just truncate, but we could solve this issue on our end instead.

We could strip zero-width spaces (and other invisible, SMILES-irrelevant unicode symbols) from SMILES strings before passing them on to RDKit, but this may not solve similar problems with codepoints that look like ASCII symbols but aren't.
We could normalize all unicode to the closest ascii-compatible character, dropping anything that is not ascii:

>>> import unicodedata
>>>
>>> normalized = unicodedata.normalize('NFKD', smiles).encode('ascii','ignore').decode()
>>> normalized
'COc1ccc(cc1)C(=O)/C(C#N)=N/Nc2ccc(cc2)[N+](=O)[O-]'
>>> Molecule.from_smiles(
...     normalized, 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
36

We could also just raise an error if any non-ASCII characters are detected. Are non-ASCII characters ever used in SMILES?

The text was updated successfully, but these errors were encountered:

j-wags · 2024-01-10T17:27:38Z

If we choose to do anything on our end, I'm kinda in favor of 3. But since this has taken 5 years to emerge I think the best course would be to raise this on the RDKit issue tracker and see if @greglandrum invites us to submit a PR to fix upstream.

greglandrum · 2024-01-10T20:18:58Z

I'd be happy to see a PR or bug report with examples to raise an error on stuff like this on the RDKit side.

Given that the parsing code is working purely with 8 bit chars and ASCII, there's no other sensible alternative

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Yoshanuikabundi commented Jan 10, 2024

j-wags commented Jan 10, 2024

greglandrum commented Jan 10, 2024 •

edited by mattwthompson

Loading

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Comments

Yoshanuikabundi commented Jan 10, 2024

j-wags commented Jan 10, 2024

greglandrum commented Jan 10, 2024 • edited by mattwthompson Loading

greglandrum commented Jan 10, 2024 •

edited by mattwthompson

Loading