Replies: 3 comments
-
Possible release scenarios
I believe these different forms of release could cover a wide range of use cases and make the database as accessible as possible. I would love to hear your thoughts on these suggestions, and if there are other avenues you believe we should explore. |
Beta Was this translation helpful? Give feedback.
-
This is definitely one of the things that should have been already done and open-sourced for public interest, considering there are lots of popular commercial platforms and services dealing with Turkish texts. Probably some of them already have a robust solution but refrain from sharing publicly for some reason. Anyway, as for solution, I suggest taking a slang dictionary as reference. The most extensive collection of generalist Turkish slang words/expressions are published under this name AFAIK. We may need a collective effort to sort irrelevant slangs out though. Since most of these slang words are derived from subcultures, minority groups, influenced from various terminologies and whatnot, we need to eliminate non-malicious ones for us to have a clean data to work with. At this point, I need to say that we also may need to maintain two different database. One for hardcore slangs such as "pez***nk" and one for softcore slangs like "serefs*z". After elimination process with the help of human judgement (much needed for cultural context as well as for having a optimally curated collection of slangs to rely on), we will have two different types of slangs. Some containing singular words like "pez***nk" and some containing plural words, representing stereotyped slangs like "o***** coc***". And of course, there is another problem regarding practical use of slangs in a sentence. Considering dynamic changes in the form of verbs and nouns (inflections), imagine how tricky it will be to detect. We've got "yapım eki", "çekim eki", "fiil çekimi", "isim halleri", "ünsüz yumuşaması", "ünsüz sertleşmesi" and other linguistic phenomenons that I can't think of. We could come up with a solution that joins all these possible modifications programmatically one by one into the string of slang itself to make another version of it so that final version of the app would be functional enough to work successfully with edge cases. We would start with the plain form of a slang, which is picked from the dictionary mentioned above, say "Aklına sı*" or "O***** cocuk", then add a suffix, for instance "u", even if it is grammatically incorrect, which eventually returns us a "O**** cocuku". Follow the same procedure and make a list of combinations to create possible inflections programmatically. Apply this pattern over and over for other possible inflections as well and you have got a plain form and 50-ish different versions of it, including correct and incorrect forms. For instance, "O*** cocugun, ...cocugunun, ...cocuktan, ...cocukdan, ...cocugtan". But also, we have to take into account that deliberate-indeliberate typos (O** cucigi), or letter extraction (O** cocgu). On top of that, plural words make it harder because combinations will exponentiate in number. OR EVEN BETTER, we start with the plain form of a slang, shorten the slang to its most simple form like "O** cocu", "Pezeve", "Si*" and create combinations programmatically, which is resulted in us having the versions with typos, extracted letters and full forms. In this case, "Peseve**", "Pezvnk", "Peze***" respectively. It would cause some trouble for sorting algorithms in terms of cost but looks like a much cheaper option compared to training a Turkish slang language model and then processing queries on it. These are results of a quick brain storming, maybe someone else can come up with entirely different thing, or pitch in some other idea to improve those above. Or we can take a look at English-centric examples to see how they approached, but it will be quite different since English is based on inflections (to, from, in, on...). I am looking forward to see how this will unfold. |
Beta Was this translation helpful? Give feedback.
-
Utilizing a pre-trained model would be a sensible approach since we lack expertise in this particular subject. Transforming one of the pre-trained models available on "https://huggingface.co/" into a service would be a great idea. |
Beta Was this translation helpful? Give feedback.
-
I am currently exploring the idea of implementing a "Turkish Swear Words Database" in our project. This addition would serve as a comprehensive resource for various applications such as text-based AI moderation, language learning tools, or linguistic research, specifically focusing on the Turkish language. The goal is to amass an extensive list of Turkish swear words, their potential variations, and, if possible, their cultural contexts, to provide a holistic understanding of how they're used in daily Turkish language.
That said, there are several key aspects to consider. The collection and verification of the data could be a sensitive process, requiring careful attention to accuracy and cultural respect. It would be important to ensure that the database is not promoting or endorsing the use of offensive language but providing an educational resource. Furthermore, we would need to discuss how to manage the moderation and maintenance of this database to ensure it remains current and relevant. I invite any thoughts, suggestions, or potential challenges that the team may foresee with this proposed feature.
Goals:
Beta Was this translation helpful? Give feedback.
All reactions