Support standard URL spec #91

jakl · 2013-06-04T20:24:42Z

http://www.w3.org/Addressing/URL/url-spec.txt

psychs · 2013-06-04T20:53:12Z

Note that we should have a different spec from the RFC. Because we need to recognize URLs in natural languages where we can't assume words are separated with white spaces. The original author really took care of that.

jakl · 2013-06-04T21:47:27Z

Oh you mean a natural language URL might have a natural contextual ending rather than a space?

URL identification has been painfully recurring and there must be a more standard way to implement this, maybe tweaking the ending delimiter to support natural languages.

psychs · 2013-06-05T02:53:50Z

We have two points on this.

We treat natural language representations of URLs, not strict URLs in RFC3986

We want to recognize natural language URLs like below.

http://de.wikipedia.org/wiki/März

It doesn't conform to the RFC. But you can see it in the address bar of web browsers like Safari, Chrome and Firefox.

As you know, it's actually encoded into a strict URL internally.

http://de.wikipedia.org/wiki/M%C3%A4rz

But it's not readable. That's why these browsers decided to show representation forms instead of strict URLs in its address bar.

To make it natural for users, we need to treat natural language representations of URLs, instead of strict URLs conform to the RFC. That's the most important point here.

It means we need to define our own spec for acceptable natural language URLs. As you can see in the above example, it must be different from RFC3986. It depends on our use cases.

Recognize URLs in natural language text

It's difficult to recognize URLs in natural language text.

What do you think about this?

(http://de.wikipedia.org/)

I think most users would expect http://de.wikipedia.org/ to be extracted, instead of http://de.wikipedia.org/). But RFC3986 allows ')' in path. So if you implement the recognizer by strictly conforming to the RFC, it will extract http://de.wikipedia.org/). I think most people don't like the behavior.

But what about this?

http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi)

I think most users expect http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi). The current twitter-text recognize this correctly.

I think you know some languages like Japanese, Chinese and Thai don't use spaces for word delimiters. In Japanese, many people write like below.

http://www.google.com/だよね?

In this case, most people expect http://www.google.com/.

As you can see in the above examples, it's clearly not easy. There is a trade off between natural behaviors and false positives.

The original author understood the problem and tuned up carefully, so that we can recognize URLs in a natural way for many people in various languages.

psychs · 2013-06-05T02:58:34Z

It's interesting that this comment system on GitHub does URL auto linkification. It should be similar to ours. They just have different design for the last example.

yaauie mentioned this issue Dec 19, 2013

Broken Conformance: URLs with unicode chars in them #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support standard URL spec #91

Support standard URL spec #91

jakl commented Jun 4, 2013

psychs commented Jun 4, 2013

jakl commented Jun 4, 2013

psychs commented Jun 5, 2013

psychs commented Jun 5, 2013

Support standard URL spec #91

Support standard URL spec #91

Comments

jakl commented Jun 4, 2013

psychs commented Jun 4, 2013

jakl commented Jun 4, 2013

psychs commented Jun 5, 2013

We treat natural language representations of URLs, not strict URLs in RFC3986

Recognize URLs in natural language text

psychs commented Jun 5, 2013