You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A zero-width space (\u200b) in a SMILES string causes RDKit to truncate the molecule without an error:
I ran into this very confusing behavior while copying SMILES strings from specs.net.
Describe the solution you'd like
Ideally, RDKit would raise a parse error rather than just truncate, but we could solve this issue on our end instead.
We could strip zero-width spaces (and other invisible, SMILES-irrelevant unicode symbols) from SMILES strings before passing them on to RDKit, but this may not solve similar problems with codepoints that look like ASCII symbols but aren't.
We could normalize all unicode to the closest ascii-compatible character, dropping anything that is not ascii:
If we choose to do anything on our end, I'm kinda in favor of 3. But since this has taken 5 years to emerge I think the best course would be to raise this on the RDKit issue tracker and see if @greglandrum invites us to submit a PR to fix upstream.
Is your feature request related to a problem? Please describe.
A zero-width space (
\u200b
) in a SMILES string causes RDKit to truncate the molecule without an error:I ran into this very confusing behavior while copying SMILES strings from specs.net.
Describe the solution you'd like
Ideally, RDKit would raise a parse error rather than just truncate, but we could solve this issue on our end instead.
The text was updated successfully, but these errors were encountered: