-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nationality->etymology / describe potential future research directions #29
Conversation
content/01.abstract.md
Outdated
We used multiple methods to estimate the race, ethnicity, gender, and nationality of authors and the recipients of these honors. | ||
To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people-nationality pairs from Wikipedia and trained long short-term memory neural networks to make predictions. | ||
We used multiple methods to estimate the race, ethnicity, gender, and name etymology of authors and the recipients of these honors. | ||
To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people name etymology pairs from Wikipedia and trained long short-term memory neural networks to make predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this instance as nationality. We're using the Wiki data to infer name etymology, but the raw data that we built is still name-country pairs, whether country of birth or of self-identification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this instance as nationality.
First you need to define "nationality". It is not citizenship, it is not religion, it is not ethnicity... what is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikipedia uses this definition: https://en.wikipedia.org/wiki/Wikipedia:Citizenship_and_nationality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idoerg I agree that we don't have a concrete definition of "nationality" here. Nationality from Wikipedia's card for a living person is essentially what Wikipedia curators consider to be the person's primary country. With this loose definition, there are definitely cases where the prediction for a name origin disagrees with "nationality". However, we tried to account for this issue by using prediction probabilities rather than hard assignments of origins for each name.
I think the confusion mostly comes from NamePrism groupings, or perhaps more on the names of these groupings. I am not aware of any consensus on how to group countries, but we do need some groupings for our algorithm to make meaningful predictions. We would love to have your insights on how to improve these groupings or rename them to reflect accurately what we're trying to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - link should have gone here: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context
content/02.introduction.md
Outdated
Existing methods were relatively US-centric because most of the data was derived in whole or in part from the US Census. | ||
We scraped more than 700,000 entries from English-language Wikipedia that contained nationality information to complement these existing methods and built multiple machine learning classifiers to predict nationality. | ||
We scraped more than 700,000 entries from English-language Wikipedia that contained name etymology information to complement these existing methods and built multiple machine learning classifiers to predict name etymology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. It's misleading to say that the Wikipedia entries contained name etymology information. I'd stick with nationality in the first part of the sentence, and then say name etymology in the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that in mind, I think I honestly prefer "name origins" over "name etymology", since the Wiki data will be loosely based on country of origin but not really on linguistics.
We were able to define a name and name etymology for 708,493 people by using the union of these strategies. | ||
Our Wikipedia-based process returned a name etymology or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. | ||
This structure comes from editor [guidance on biography articles](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context) and is designed to capture: | ||
> ... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful to have here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go with "name etymology" for now, but I'm leaning toward "name origins" after re-reading the complete manuscript.
Co-Authored-By: Trang Le <[email protected]>
@trang1618 : made a few changes throughout to support the etymology -> origin change. It probably still needs more tuning but I think it's an improvement over what we have deployed. If the build passes, I will merge. I'll leave #27 open so we can continue discussion upon your return. |
[ci skip] This build is based on e642781. This commit was created by the following CI build and job: https://github.com/greenelab/iscb-diversity-manuscript/commit/e64278142b4fdd02cdbed5a8b6823cf58ecd9ebc/checks https://github.com/greenelab/iscb-diversity-manuscript/runs/run3
[ci skip] This build is based on e642781. This commit was created by the following CI build and job: https://github.com/greenelab/iscb-diversity-manuscript/commit/e64278142b4fdd02cdbed5a8b6823cf58ecd9ebc/checks https://github.com/greenelab/iscb-diversity-manuscript/runs/run3
In #27 @idoerg helpfully noted that "nationality" was an imprecise concept for what we are able to measure. This converts to the term "name etymology" and also adds a more nuanced discussion of how name etymology and affiliations might be combined for a study into potential factors underlying discrepancies.