Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nationality->etymology / describe potential future research directions #29

Merged
merged 7 commits into from
Feb 1, 2020

Conversation

cgreene
Copy link
Member

@cgreene cgreene commented Jan 31, 2020

In #27 @idoerg helpfully noted that "nationality" was an imprecise concept for what we are able to measure. This converts to the term "name etymology" and also adds a more nuanced discussion of how name etymology and affiliations might be combined for a study into potential factors underlying discrepancies.

We used multiple methods to estimate the race, ethnicity, gender, and nationality of authors and the recipients of these honors.
To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people-nationality pairs from Wikipedia and trained long short-term memory neural networks to make predictions.
We used multiple methods to estimate the race, ethnicity, gender, and name etymology of authors and the recipients of these honors.
To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people name etymology pairs from Wikipedia and trained long short-term memory neural networks to make predictions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this instance as nationality. We're using the Wiki data to infer name etymology, but the raw data that we built is still name-country pairs, whether country of birth or of self-identification.

Copy link

@idoerg idoerg Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this instance as nationality.

First you need to define "nationality". It is not citizenship, it is not religion, it is not ethnicity... what is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idoerg I agree that we don't have a concrete definition of "nationality" here. Nationality from Wikipedia's card for a living person is essentially what Wikipedia curators consider to be the person's primary country. With this loose definition, there are definitely cases where the prediction for a name origin disagrees with "nationality". However, we tried to account for this issue by using prediction probabilities rather than hard assignments of origins for each name.

I think the confusion mostly comes from NamePrism groupings, or perhaps more on the names of these groupings. I am not aware of any consensus on how to group countries, but we do need some groupings for our algorithm to make meaningful predictions. We would love to have your insights on how to improve these groupings or rename them to reflect accurately what we're trying to do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing methods were relatively US-centric because most of the data was derived in whole or in part from the US Census.
We scraped more than 700,000 entries from English-language Wikipedia that contained nationality information to complement these existing methods and built multiple machine learning classifiers to predict nationality.
We scraped more than 700,000 entries from English-language Wikipedia that contained name etymology information to complement these existing methods and built multiple machine learning classifiers to predict name etymology.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. It's misleading to say that the Wikipedia entries contained name etymology information. I'd stick with nationality in the first part of the sentence, and then say name etymology in the second.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that in mind, I think I honestly prefer "name origins" over "name etymology", since the Wiki data will be loosely based on country of origin but not really on linguistics.

We were able to define a name and name etymology for 708,493 people by using the union of these strategies.
Our Wikipedia-based process returned a name etymology or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors.
This structure comes from editor [guidance on biography articles](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context) and is designed to capture:
> ... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful to have here!

Copy link
Collaborator

@trangdata trangdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go with "name etymology" for now, but I'm leaning toward "name origins" after re-reading the complete manuscript.

content/20.results.md Outdated Show resolved Hide resolved
@cgreene
Copy link
Member Author

cgreene commented Feb 1, 2020

@trang1618 : made a few changes throughout to support the etymology -> origin change. It probably still needs more tuning but I think it's an improvement over what we have deployed. If the build passes, I will merge. I'll leave #27 open so we can continue discussion upon your return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants