Skip to content
This repository has been archived by the owner on Sep 18, 2021. It is now read-only.

Support standard URL spec #91

Open
jakl opened this issue Jun 4, 2013 · 4 comments
Open

Support standard URL spec #91

jakl opened this issue Jun 4, 2013 · 4 comments

Comments

@jakl
Copy link
Contributor

jakl commented Jun 4, 2013

http://www.w3.org/Addressing/URL/url-spec.txt

@psychs
Copy link
Contributor

psychs commented Jun 4, 2013

Note that we should have a different spec from the RFC. Because we need to recognize URLs in natural languages where we can't assume words are separated with white spaces. The original author really took care of that.

@jakl
Copy link
Contributor Author

jakl commented Jun 4, 2013

Oh you mean a natural language URL might have a natural contextual ending rather than a space?

URL identification has been painfully recurring and there must be a more standard way to implement this, maybe tweaking the ending delimiter to support natural languages.

@psychs
Copy link
Contributor

psychs commented Jun 5, 2013

We have two points on this.

We treat natural language representations of URLs, not strict URLs in RFC3986

We want to recognize natural language URLs like below.

http://de.wikipedia.org/wiki/März

It doesn't conform to the RFC. But you can see it in the address bar of web browsers like Safari, Chrome and Firefox.

As you know, it's actually encoded into a strict URL internally.

http://de.wikipedia.org/wiki/M%C3%A4rz

But it's not readable. That's why these browsers decided to show representation forms instead of strict URLs in its address bar.

To make it natural for users, we need to treat natural language representations of URLs, instead of strict URLs conform to the RFC. That's the most important point here.

It means we need to define our own spec for acceptable natural language URLs. As you can see in the above example, it must be different from RFC3986. It depends on our use cases.

Recognize URLs in natural language text

It's difficult to recognize URLs in natural language text.

What do you think about this?

(http://de.wikipedia.org/)

I think most users would expect http://de.wikipedia.org/ to be extracted, instead of http://de.wikipedia.org/). But RFC3986 allows ')' in path. So if you implement the recognizer by strictly conforming to the RFC, it will extract http://de.wikipedia.org/). I think most people don't like the behavior.

But what about this?

http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi)

I think most users expect http://en.wikipedia.org/wiki/The_Four_Seasons_(Vivaldi). The current twitter-text recognize this correctly.

I think you know some languages like Japanese, Chinese and Thai don't use spaces for word delimiters. In Japanese, many people write like below.

http://www.google.com/だよね?

In this case, most people expect http://www.google.com/.

As you can see in the above examples, it's clearly not easy. There is a trade off between natural behaviors and false positives.

The original author understood the problem and tuned up carefully, so that we can recognize URLs in a natural way for many people in various languages.

@psychs
Copy link
Contributor

psychs commented Jun 5, 2013

It's interesting that this comment system on GitHub does URL auto linkification. It should be similar to ours. They just have different design for the last example.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants