Courageous Comets #3

janine9vn · 2024-08-15T18:47:33Z

No description provided.

Co-authored-by: thijsfranck <[email protected]>

Co-authored-by: isaa-ctaylor <[email protected]>

BREAKING CHANGE: dependencies are no longer updated

…ettier

…cbf2c2a37' git-subtree-dir: courageous-comets git-subtree-mainline: 196cae3 git-subtree-split: a567d6e

KeithTheEE

This project was huge. It was a lot of fun to see what it was doing, go through NLTK functions and see them used and applied here, and see how the data was managed and tricks you used to keep things fast.

Overall I liked it a lot. My comments are really critical and are written as if you had way more time to work on this project, so I want you know they're coming from someone who really enjoyed stepping through the code, and are written from the perspective of what I'd like to see in the next steps.

There was a lot I didn't comment on just because I'm not familiar enough with the format/linting/ci/cd/etc. From all I can tell it looks solid and stable, but the nlp side was more my speed so my comments focus more on that.

Things I really liked

The professionalism of the project is impressive. The commit history with leading words indicating what the commit was suppose achieve, feature vs fix, etc, was really nice. The documentation was also exceptional and helped with a lot of figuring things out.

I liked the 'avoid hammering the api' features of the bot. Using the db and cache is an important step past simple call/response loops that make scaling as restarting, changes, and multiple users begin adding load.

Also I love NLP projects, I liked the stemming and search features with embeddings. That stuff is just fun and discord is a great place to implement and learn more about them while having fun with data that makes a lot more sense than academia data can at first blush.

If you had more time

The project was huge. There was so much boilerplate, and I think it added to the difficulty to the organization of the project overall.

A lot of the boilerplate could be condensed into a family of functions to just check error conditions and kept in a single location. Then on a command from a user, it could run through it's checks, then a function for things like semantics could just be the semantics with a single function call for the error checks at the very top rather than the column of checks in so many functions.

Similarly there were points where you were building thresholds to grab within a radius of an embedding in the 'grab from redis function'. that prep should probably be further from the redis layer that way the only thing the redis file would handle is the database interaction itself. Mapping the text to an embedding and saying how similar you want your matches to be all belongs to the same process, and the db call just be very minimal.

Additionally there's a lot of nlp functions spread everywhere. You have stuff about the word frequency in a cog, in that you have another directory for keywords and sentiment, then in nltk you have functionality, then you've got a hugging face transformer directory. The UI view has a lot of functions for each of the nlp functions. The main level has a sentiment file as well.

It got to be a wild chase across a lot of tabs to figure out what went where. I think I'd like to have a large refactor, and rather than having it broken out into cogs, nltk, redis, ui, and main level, I feel like you should have a class for each feature (sentiment/frequency/etc), and the class has a ui attribute, a query redis attribute, a process text attribute, etc.

I feel like the project grew very wide very fast, and that in combination with the boilerplate made it really hard to see how best to group things and refactoring became technical debt very rapidly.

A lot of the function groups do make a ton of sense--redis is great, discord interactions make sense. But the nlp was spread across so many places.

Future development

Ok if you were to keep building this, here's added bells and whistles I'd love in the theme of "information overload". And this is absolutely out of scope, but since you've built a great platform and a fun bot I thought you might enjoy hearing about possible ways to extend it.

Guild Similarity Scores: If a user brings the bot into two guilds, how similar are they? Latent Semantic Analysis and Latent Dirichlet Allocation are great ways to process the semantic similarity of servers.

https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

MORE PLOTS: I love visualization, and yours are great but there's so much you can do. For each channel in a guild, you can parse the messages, do the aforementioned LSA or LDA, and then plot processed messages into some point space and reduce it with something like SVD or UMAP.

No Deep Learning:
There's no reason not to use transformers or deep learning, but there are really powerful ways to make embeddings that don't rely on them. TF-IDF combined with dimension reduction works on smaller training sets than a lot of deep learning models so if you find some specific servers are not performing well, you can build your own embedding and use that without the need for anything from pytorch.

https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#k-means-clustering-on-text-features

In all, this was an awesome project! I really enjoyed stepping through it and seeing what features you added and how you processed calls. Well done!

KeithTheEE · 2024-09-01T06:33:30Z

courageous-comets/courageous_comets/__main__.py

+    If a critical error occurs, attempt to shut down gracefully.
+    """
+    # Override logging configuration by dependencies
+    settings.setup_logging()


I like the amount of logging--it is very helpful to trace what broke if a version of something isn't working, but when you have bots interact with other users it might be helpful to think of being able to give those logs to various levels of users.

The logs you've got working right now are great dev logs, but moderators who have your bot in their community will also want to be able to query the bot to understand why the bot performed a behavior even if the bot isn't running on their hardware, and users will need to be able to understand what broke when they have a broken command/response.

A fair amount of the logging calls made are pieces of info that shouldn't be held to only the devs. This project team has a great grasp on what kind of information is helpful to log, but to level it up I'd like to see the team split the logged information into blocks for different groups of users. That'll go the needed extra step--having information is good, but until you get it to the right people it doesn't do nearly as much as it could.

KeithTheEE · 2024-09-01T14:21:47Z

courageous-comets/courageous_comets/cogs/keywords/search_command.py

+            interaction.id,
+        )
+
+        if self.bot.redis is None:


A lot of these boilerplate safety checks should be moved into their own function.

With it kept in every call, each function is very bloated and takes a lot of time for the reader to parse, and it's so many repetitive tasks. Collecting all of the typical, 'check if redis is up, check if we're in a guild' etc tasks in one location each function is a lot easier to review and to make changes to.

The cost of the abstraction is added ambiguity with what threw the error, so to match that there will also probably need to be bot-state flags added to the call to process each of these checks. While it's a small amount of overhead, it makes tracing errors faster and clearer, and that is generally helpful even without this move.

KeithTheEE · 2024-09-01T14:29:46Z

courageous-comets/courageous_comets/cogs/keywords/search_command.py

+        query_processed = preprocessing.process(query)
+        embedding = await self.bot.vectorizer.aencode(query_processed)
+
+        messages = await get_messages_by_semantics_similarity(


I really like how this order of operations avoids putting many api calls on Discord. processing from the local db first and paring down the message space is a really good practice

KeithTheEE · 2024-09-01T14:37:41Z

courageous-comets/courageous_comets/redis/messages.py

+    """
+    search_scope = build_search_scope(guild_id, ids, scope)
+
+    query = VectorQuery(


I was about to comment on how queries will get bloated and the embedding should be managed on their own database, that way the space can be subdivided instead of iterated over, but this seems to do exactly that. I've learned something pretty nifty today.

KeithTheEE · 2024-09-01T14:48:48Z

courageous-comets/courageous_comets/cogs/keywords/search_context_menu.py

+                redis=self.bot.redis,
+                vectorizer=self.bot.vectorizer,
+            )
+


I'd like to either see comments here about each step, or a bit in the docstrings about the steps taken, that way if a changes the behavior of a function, you have some comment explaining what the function 'should' be doing. Unit tests are great to catch when a change broke something, but then you still are often left trying to decide what the function was originally suppose to be doing.

As it stands as an outside reviewer, I'm scanning trying to ignore the safety checks as those are repetitive, then seeing you're preprocessing a message, to pass it to an embedding, etc. A heads up that,

This function uses the context menu,
takes a message,
preprocesses then vectorizes it,
compares the embedding of the input to the embedded in the db
and returns up to 6 messages with similar semantics

Or comments ahead of each processing step would make it easier to follow. I had to click through the project to get the bot vectorizer function on display to actually check what was happening in this step, and that's a lot of work to just follow along.

KeithTheEE · 2024-09-02T00:04:24Z

courageous-comets/examples/cog.py

+
+# Boilerplate cog code.
+# To build a cog:
+#   1. Copy this boilerplate to another file in the cogs directory


Is this cog needed in the codebase at this point?

KeithTheEE · 2024-09-02T00:07:28Z

courageous-comets/examples/test__example.py

+"""
+The `test__example` module contains example tests for a Python project.
+
+For guidelines on writing good tests, see the contributor guide.


Similarly, this example probably is either unneeded, or it should live somewhere else. A top level directory with this sitting in it feels like it's a forgotten artifact.

KeithTheEE · 2024-09-02T00:09:52Z

courageous-comets/tests/conftest.py

+from courageous_comets.redis.schema import MESSAGE_SCHEMA
+from courageous_comets.transformers import init_transformers
+from courageous_comets.vectorizer import Vectorizer
+


I like that this configuration split out, it's easy to read and helps setup the rest of the tests.

KeithTheEE · 2024-09-02T00:57:18Z

courageous-comets/tests/courageous_comets/test__sentiment.py

+        pos=mocker.ANY,
+        compound=mocker.ANY,
+    )
+    result = calculate_sentiment("I love this product!")


That's a very nice little test string. :)

As the feature set grows, it might be convenient to have all of the strings you test, and the expected outcome over each kind of test, held in their own file rather than in each test.

KeithTheEE · 2024-09-02T01:21:22Z

courageous-comets/tests/integrations/test__redis.py

+from courageous_comets.sentiment import calculate_sentiment
+from courageous_comets.vectorizer import Vectorizer
+from courageous_comets.words import tokenize_sentence, word_frequency
+


The tests are good overall and are way more than usually seen, but if a specific class of functions have a common failure mode, it's be nice to have blurbs that explain it in the test.

Having something fail, vs having it tell you what needs to change is a big difference. With NLP processes, "Success" can be ambiguous, so a commentary about what a failure means and what could make it a success could help a lot when it comes to integrating upgrades..

Here in the redis tests, there can be a lot of weird db failure modes and a bit explaining what a failure could be caused by might help during upgrade attempts versions down the line.

thijsfranck and others added 30 commits July 15, 2024 21:03

Initial commit

6812c62

chore: initial commit

60e0397

ci: ensure actions use bash

7998a79

bump: version 0.0.0 → 0.0.1

b761ecc

docs: add logo and favicon (#1)

b8cb2b1

ci: set up secrets management to support development token (#2)

e1270fd

bump: version 0.0.1 → 0.0.2

7d68bd1

docs: update installation instructions for sop on linux (#3)

105fedd

Co-authored-by: thijsfranck <[email protected]>

docs: add video installation guide

099cab8

docs: embed video controls as html elements

150df91

bump: version 0.0.2 → 0.0.3

07b749f

docs: restructure setup guide

16514ef

bump: version 0.0.3 → 0.0.4

8886bea

ci: switch to age for better secrets management (#4)

db63bb8

bump: version 0.0.4 → 0.0.5

c68cea5

chore: add key to ACL

1608132

ci: set up docker build (#9)

7ac17a3

ci: add redis to devcontainer (#7)

b21efe5

ci: Add age key to .sops.yaml and re-encrypt .env

9cfcdf5

ci: add docker image metadata

af36cd2

docs: add admin guide and installation link on home page

db8fb43

bump: version 0.0.5 → 0.0.6

f22079f

Set up global redis connection (#13)

3ce32a7

Co-authored-by: thijsfranck <[email protected]>

fix: set redis host from environment variables

f6ceb5b

refactor: improve error handling and logging

28f19ac

docs: describe configuration options

ee92d70

feat: add bot boilerplate

3963ce5

Co-authored-by: thijsfranck <[email protected]>

fix: include application config in docker image (#15)

c3808b6

Co-authored-by: isaa-ctaylor <[email protected]>

docs: describe new config options

3b47542

ci: fix redis password config

32ebf9e

elfkuzco and others added 29 commits July 28, 2024 14:25

feat: frequency time divisions (#62)

7a5fd4d

bump: version 0.9.0 → 0.10.0

4619536

feat: lock dependencies for final version

8e1ec12

BREAKING CHANGE: dependencies are no longer updated

bump: version 0.10.0 → 1.0.0

7ab7ff6

build: downgrade torch and lock to patch version

2e5637c

docs: extend technical documentation and start adding use cases

ce641cc

docs: fix page heading

ddc1b4b

docs: reword storage section (#63)

260692b

ci: add prettier as default markdown formatter

bbfa49a

docs: add getting started guide and format markdown documents with pr…

0d0bd48

…ettier

docs: improve phrasing

4febb96

docs: emphasize context menu features for search

94596fe

docs: remove space after query

87d3982

docs: reorganize home page using cards

c3483c8

docs: adjust introduction

10ebc5e

docs: update user guide screenshots

4ef2229

docs: extend configuration guide

22ad607

docs: extend design documentation

668e62e

docs: clarify choice for redis stack

45da31f

docs: improve phrasing

47a6130

docs: capitalize headings

a4bd762

docs: reword redis data design decisions (#65)

5582403

docs: improve introduction of moderation features

a95fa54

docs: add design notes for analysis steps

b3cb07f

docs: break up sentiment analysis scores paragraph

e005026

docs: clarify sentiment scoring method

91ad828

docs: reword vectorization operation (#66)

86e4939

docs: fix typo

a567d6e

Add 'courageous-comets/' from commit 'a567d6e78c6fc624fbbd79300191347…

963da1c

…cbf2c2a37' git-subtree-dir: courageous-comets git-subtree-mainline: 196cae3 git-subtree-split: a567d6e

KeithTheEE reviewed Sep 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Courageous Comets #3

Courageous Comets #3

janine9vn commented Aug 15, 2024

KeithTheEE left a comment

KeithTheEE Sep 1, 2024

KeithTheEE Sep 1, 2024

KeithTheEE Sep 1, 2024

KeithTheEE Sep 1, 2024

KeithTheEE Sep 1, 2024

KeithTheEE Sep 2, 2024

KeithTheEE Sep 2, 2024

KeithTheEE Sep 2, 2024

KeithTheEE Sep 2, 2024

KeithTheEE Sep 2, 2024

Courageous Comets #3

Are you sure you want to change the base?

Courageous Comets #3

Conversation

janine9vn commented Aug 15, 2024

KeithTheEE left a comment

Choose a reason for hiding this comment

Things I really liked

If you had more time

Future development

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment