Implementation of Turkish Swear Words Database #402

usirin · 2023-05-30T22:49:12Z

usirin
May 30, 2023
Maintainer

I am currently exploring the idea of implementing a "Turkish Swear Words Database" in our project. This addition would serve as a comprehensive resource for various applications such as text-based AI moderation, language learning tools, or linguistic research, specifically focusing on the Turkish language. The goal is to amass an extensive list of Turkish swear words, their potential variations, and, if possible, their cultural contexts, to provide a holistic understanding of how they're used in daily Turkish language.

That said, there are several key aspects to consider. The collection and verification of the data could be a sensitive process, requiring careful attention to accuracy and cultural respect. It would be important to ensure that the database is not promoting or endorsing the use of offensive language but providing an educational resource. Furthermore, we would need to discuss how to manage the moderation and maintenance of this database to ensure it remains current and relevant. I invite any thoughts, suggestions, or potential challenges that the team may foresee with this proposed feature.

Goals:

Collect an extensive list of Turkish swear words along with their potential variations.
Understand and document the cultural context of these words where possible.
Create a process for verifying the accuracy of the data collected.
Ensure the database serves as an educational resource and does not promote or endorse the use of offensive language.
Develop a strategy for the moderation and maintenance of the database.
Discuss potential challenges and work towards solutions together as a team.
Ensure the database stays current and relevant over time.

usirin · 2023-05-30T23:08:08Z

usirin
May 30, 2023
Maintainer Author

Possible release scenarios

NPM Package: We could package the database as an npm module that developers can easily install and import into their projects. This would offer great flexibility, as it could be integrated into a variety of applications, like chatbots, language learning apps, or text-moderation systems. The package could expose an API that allows developers to search for swear words, check if a word is considered a swear word, and potentially even provide some context or usage examples.
Web Service: In addition to, or instead of an npm package, we could also consider creating a RESTful API or GraphQL service. This would make the database accessible to projects beyond the JavaScript ecosystem, widening its potential use. It could have similar features to the npm package, but accessed over HTTP(S).
Website: We could also consider building a dedicated website for the Turkish Swear Words Database. The website could serve as an interactive platform for people to explore the database, learn about the cultural context of the words, and perhaps even contribute to the database. This could also be a great opportunity to provide educational resources about language and culture, and make clear our intention to promote understanding, not the use of offensive language.

I believe these different forms of release could cover a wide range of use cases and make the database as accessible as possible. I would love to hear your thoughts on these suggestions, and if there are other avenues you believe we should explore.

0 replies

ubsoydan · 2023-05-31T01:03:35Z

ubsoydan
May 31, 2023

This is definitely one of the things that should have been already done and open-sourced for public interest, considering there are lots of popular commercial platforms and services dealing with Turkish texts. Probably some of them already have a robust solution but refrain from sharing publicly for some reason.

Anyway, as for solution, I suggest taking a slang dictionary as reference. The most extensive collection of generalist Turkish slang words/expressions are published under this name AFAIK.

We may need a collective effort to sort irrelevant slangs out though. Since most of these slang words are derived from subcultures, minority groups, influenced from various terminologies and whatnot, we need to eliminate non-malicious ones for us to have a clean data to work with. At this point, I need to say that we also may need to maintain two different database. One for hardcore slangs such as "pez***nk" and one for softcore slangs like "serefs*z".

After elimination process with the help of human judgement (much needed for cultural context as well as for having a optimally curated collection of slangs to rely on), we will have two different types of slangs. Some containing singular words like "pez***nk" and some containing plural words, representing stereotyped slangs like "o***** coc***".
Identifying second group is crucial if we want to censor the slang rather than asking user to remove it before validation. Otherwise, we would only end up censoring the main slang "o*****", leaving second part as it is (... cocugu ...).

And of course, there is another problem regarding practical use of slangs in a sentence. Considering dynamic changes in the form of verbs and nouns (inflections), imagine how tricky it will be to detect. We've got "yapım eki", "çekim eki", "fiil çekimi", "isim halleri", "ünsüz yumuşaması", "ünsüz sertleşmesi" and other linguistic phenomenons that I can't think of. We could come up with a solution that joins all these possible modifications programmatically one by one into the string of slang itself to make another version of it so that final version of the app would be functional enough to work successfully with edge cases.

We would start with the plain form of a slang, which is picked from the dictionary mentioned above, say "Aklına sı*" or "O***** cocuk", then add a suffix, for instance "u", even if it is grammatically incorrect, which eventually returns us a "O**** cocuku". Follow the same procedure and make a list of combinations to create possible inflections programmatically. Apply this pattern over and over for other possible inflections as well and you have got a plain form and 50-ish different versions of it, including correct and incorrect forms. For instance, "O*** cocugun, ...cocugunun, ...cocuktan, ...cocukdan, ...cocugtan". But also, we have to take into account that deliberate-indeliberate typos (O** cucigi), or letter extraction (O** cocgu). On top of that, plural words make it harder because combinations will exponentiate in number.

OR EVEN BETTER, we start with the plain form of a slang, shorten the slang to its most simple form like "O** cocu", "Pezeve", "Si*" and create combinations programmatically, which is resulted in us having the versions with typos, extracted letters and full forms. In this case, "Peseve**", "Pezvnk", "Peze***" respectively.

It would cause some trouble for sorting algorithms in terms of cost but looks like a much cheaper option compared to training a Turkish slang language model and then processing queries on it.

These are results of a quick brain storming, maybe someone else can come up with entirely different thing, or pitch in some other idea to improve those above. Or we can take a look at English-centric examples to see how they approached, but it will be quite different since English is based on inflections (to, from, in, on...). I am looking forward to see how this will unfold.

0 replies

Luuwa · 2023-05-31T04:32:41Z

Luuwa
May 31, 2023
Maintainer

Utilizing a pre-trained model would be a sensible approach since we lack expertise in this particular subject. Transforming one of the pre-trained models available on "https://huggingface.co/" into a service would be a great idea.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Turkish Swear Words Database #402

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Implementation of Turkish Swear Words Database #402

usirin May 30, 2023 Maintainer

Goals:

Replies: 3 comments

usirin May 30, 2023 Maintainer Author

Possible release scenarios

ubsoydan May 31, 2023

Luuwa May 31, 2023 Maintainer

usirin
May 30, 2023
Maintainer

usirin
May 30, 2023
Maintainer Author

ubsoydan
May 31, 2023

Luuwa
May 31, 2023
Maintainer