-
Notifications
You must be signed in to change notification settings - Fork 133
Support standard URL spec #91
Comments
Note that we should have a different spec from the RFC. Because we need to recognize URLs in natural languages where we can't assume words are separated with white spaces. The original author really took care of that. |
Oh you mean a natural language URL might have a natural contextual ending rather than a space? URL identification has been painfully recurring and there must be a more standard way to implement this, maybe tweaking the ending delimiter to support natural languages. |
We have two points on this. We treat natural language representations of URLs, not strict URLs in RFC3986We want to recognize natural language URLs like below. http://de.wikipedia.org/wiki/März It doesn't conform to the RFC. But you can see it in the address bar of web browsers like Safari, Chrome and Firefox. As you know, it's actually encoded into a strict URL internally. http://de.wikipedia.org/wiki/M%C3%A4rz But it's not readable. That's why these browsers decided to show representation forms instead of strict URLs in its address bar. To make it natural for users, we need to treat natural language representations of URLs, instead of strict URLs conform to the RFC. That's the most important point here. It means we need to define our own spec for acceptable natural language URLs. As you can see in the above example, it must be different from RFC3986. It depends on our use cases. Recognize URLs in natural language textIt's difficult to recognize URLs in natural language text. What do you think about this? I think most users would expect http://de.wikipedia.org/ to be extracted, instead of http://de.wikipedia.org/). But RFC3986 allows ')' in path. So if you implement the recognizer by strictly conforming to the RFC, it will extract http://de.wikipedia.org/). I think most people don't like the behavior. But what about this? http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi) I think most users expect http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi). The current twitter-text recognize this correctly. I think you know some languages like Japanese, Chinese and Thai don't use spaces for word delimiters. In Japanese, many people write like below. In this case, most people expect http://www.google.com/. As you can see in the above examples, it's clearly not easy. There is a trade off between natural behaviors and false positives. The original author understood the problem and tuned up carefully, so that we can recognize URLs in a natural way for many people in various languages. |
It's interesting that this comment system on GitHub does URL auto linkification. It should be similar to ours. They just have different design for the last example. |
http://www.w3.org/Addressing/URL/url-spec.txt
The text was updated successfully, but these errors were encountered: