-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularize #72
base: modularize
Are you sure you want to change the base?
Modularize #72
Conversation
Ping me once you have something you want to be reviewed :) |
Hi, thanks for taking the time! So a sanity check of the general implementation would be great: PropertiesThere is not anymore any distinction between #[derive(Debug, Clone, PartialEq)]
pub struct Token<'t> {
text: &'t str,
span: Span,
is_sentence_start: bool,
is_sentence_end: bool,
has_space_before: bool,
tags: Option<Tags<'t>>,
chunks: Option<Vec<String>>,
} i.e. there are some attributes which are always set ( impl<'t> Token<'t> {
pub fn tags(&self) -> Result<&Tags<'t>, crate::Error>;
pub fn tags_mut(&mut self) -> Result<&mut Tags<'t>, crate::Error>;
pub fn chunks(&self) -> Result<&[String], crate::Error>;
pub fn chunks_mut(&mut self) -> Result<&mut Vec<String>, crate::Error>;
}
|
Forgot to ping you: @drahnr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of superficial review
let tagger = Tagger::build(serde_json::from_value(paths_value.clone())?, None)?; | ||
let mut build_info = BuildInfo::new(&tagger, &paths.regex_cache)?; | ||
|
||
macro_rules! build { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: A generic fn would work too if supplied with a Buildable
trait bound?
} | ||
// println!("Tokens: {:#?}", tokens.collect::<Vec<_>>()); | ||
// println!("Suggestions: {:#?}", rules.suggest(&opts.text, &tokenizer)); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on re-using this? If not, it tends to be cleaner in the long run to delete + commit and revert as needed than to comment + commit.
|
||
/// Split a text at the points where the given function is true. | ||
/// Keeps the separators. See https://stackoverflow.com/a/40296745. | ||
fn split<F>(text: &str, split_func: F) -> Vec<&str> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the assumption of a single character always true? There are the unicode bold variants of ?!.
/// - Behavior for trailing whitespace is not defined. Can be included in the last sentence or not be part of any sentence. | ||
pub struct SentenceIter<'t> { | ||
text: &'t str, | ||
splits: Vec<Range<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to use two type aliases: CharRange
and ByteRange
to disambiguate throughout the code.
Having those test methods feels odd. There must be a better way to enforce the trait bound.
Is there a limitation on consuming and returning an
Can these suggestions be overlapping? I'd assume so, and that should be stated explicitly.
That's a good practice.
I'd argue that you should limit yourself to the simple
I am not sure I can follow. The properties exist merely to assure propagation of which things are going to be read. But imho this should be doable by specifying a type handful of traits and trait bounds, so this is checked at compile time. So if you implement a type, it may return a concrete type
I am not sure this pays off. It might be just easier to use a default passthrough chunker (especially if you consider the compile time impl outlined above)
I fully agree with this, but that changeset seems a bit orthogonal to the rest. It might make sense to split that off. This is an initial review, I'll try to find some time to dig a bit deeper. Very much like these writeups! You are doing a pretty darn good job there. A small nit: From the end users point of view, the API (and hence the pipeline creation) is a key element here and should be smooth and generate sane errors. If error types mismatch, that's usually a headscratch type of experience. Trait bounds are preferable since they are explicit of the requirements of what is passed in and can easily be obtained via docs.rs i.e. |
Notify @bminixhofer |
Thanks for the feedback! I'm a bit busy at the moment so I'll need longer to respond sometimes. I'll answer your comments to the write up for now, I'll look at the code later (also, the code isn't fully ready for review yet).
I don't understand what you mean here. But I also didn't explain what the
The trait bound is
Suggestions can not be overlapping. Should still be stated explicitly though :)
That's a good point, I really like that! My rationale for having these three kinds of pipelines was that nlprule should also be usable for NLU (similar to Spacy) to do chunking, pos-tagging, etc. But having only one
I actually tried checking it at compile time first: The problem here is that individual Saying that any
At least release-wise I'll do this at once, since adding all the extra binaries to the current setup would probably be more work than just changing it. But yes, PR-wise it makes sense to split it off.
I'll keep that in mind. The current distinction between I really appreciate the time you're putting into this. The main takeaway for now is that I'll remove the |
@bminixhofer what's the status on this? |
Hi! I am currently not actively working on nlprule - I hope to circle back at some point but I can't promise any timeline. |
This is the main modularization PR. Fixes #50.
I've been quite busy lately but I've gotten around to doing what has become to some degree a rewrite now :)
Now there are components called
Chunker
,Tagger
,Tokenizer
,MultiwordTagger
,Disambiguator
andRules
which can be composed using aPipeline
like:In addition, binary hosting will move to crates.io with a tighter integration in the crate to avoid having to deal with an increasing amount of binaries:
There's still some work left to do:
And I'm still open to changes in the top-level API. However, the hard part (separating the components) is definitely done.
This will also first be made available as pre-release.
@drahnr IIRC you offered to review this in the past, but there is probably a huge diff so I'm not sure if a full review would make sense (also, this is not yet ready for review). I'll write up the most important points of change in the coming days, I would greatly appreciate your feedback on that. There's also some smaller issues I'm still undecided about among those.