Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize #72

Open
wants to merge 15 commits into
base: modularize
Choose a base branch
from
Open

Modularize #72

wants to merge 15 commits into from

Conversation

bminixhofer
Copy link
Owner

@bminixhofer bminixhofer commented May 29, 2021

This is the main modularization PR. Fixes #50.

I've been quite busy lately but I've gotten around to doing what has become to some degree a rewrite now :)

Now there are components called Chunker, Tagger, Tokenizer, MultiwordTagger, Disambiguator and Rules which can be composed using a Pipeline like:

let correcter = Pipeline::new((
    Tokenizer::new("tokenizer.bin")?,
    MultiwordTagger::new("multiword_tagger.bin")?,
    Chunker::new("chunker.bin")?,
    Disambiguator::new("disambiguator.bin")?,
    Rules::new("rules.bin")?
))?;

correcter.suggest("...")

In addition, binary hosting will move to crates.io with a tighter integration in the crate to avoid having to deal with an increasing amount of binaries:

// behind a `binaries` feature flag
// maybe also feature flags for each language
use nlprule::lang::en;

let correcter = en::correcter();

There's still some work left to do:

  • Update docs
  • Update tests
  • Add tests for new functionality creating by modularization
  • Update CI
  • Update Python bindings

And I'm still open to changes in the top-level API. However, the hard part (separating the components) is definitely done.

This will also first be made available as pre-release.

@drahnr IIRC you offered to review this in the past, but there is probably a huge diff so I'm not sure if a full review would make sense (also, this is not yet ready for review). I'll write up the most important points of change in the coming days, I would greatly appreciate your feedback on that. There's also some smaller issues I'm still undecided about among those.

@drahnr
Copy link
Contributor

drahnr commented May 30, 2021

Ping me once you have something you want to be reviewed :)

@bminixhofer
Copy link
Owner Author

bminixhofer commented Jun 7, 2021

Hi, thanks for taking the time! So a sanity check of the general implementation would be great:

Properties

There is not anymore any distinction between Token / IncompleteToken and Sentence / IncompleteSentence. Instead, there is just a Sentence consisting of Tokens. Tokens now look like this:

#[derive(Debug, Clone, PartialEq)]
pub struct Token<'t> {
    text: &'t str,
    span: Span,
    is_sentence_start: bool,
    is_sentence_end: bool,
    has_space_before: bool,
    tags: Option<Tags<'t>>,
    chunks: Option<Vec<String>>,
}

i.e. there are some attributes which are always set (is_sentence_start etc.) and some which may not be set (tags and chunks). These are referred to as properties. and can be accessed like:

impl<'t> Token<'t> {
    pub fn tags(&self) -> Result<&Tags<'t>, crate::Error>;
    pub fn tags_mut(&mut self) -> Result<&mut Tags<'t>, crate::Error>;
    pub fn chunks(&self) -> Result<&[String], crate::Error>;
    pub fn chunks_mut(&mut self) -> Result<&mut Vec<String>, crate::Error>;
}

Component, Transform, Tokenize, Suggest traits

There is now a Component trait which describes the pragmatics for an NLPRule binary: needs to have a name, be creatable from a reader, and writable:

pub trait Component: Serialize + DeserializeOwned + Clone {
    fn name() -> &'static str;


    fn new<P: AsRef<Path>>(p: P) -> Result<Self, crate::Error> {
        let reader = BufReader::new(File::open(p.as_ref())?);
        Self::from_reader(reader)
    }


    fn from_reader<R: Read>(reader: R) -> Result<Self, crate::Error> {
        Ok(bincode::deserialize_from(reader)?)
    }


    fn to_writer<W: Write>(&self, writer: W) -> Result<(), crate::Error> {
        Ok(bincode::serialize_into(writer, self)?)
    }
}

There are five components currently: chunker, multiword_tagger, rules, tagger, tokenizer.

there are three new traits that describe the different kinds of component something can be: Tokenize, Transform and Suggest:

pub trait Tokenize {
    // omitted

    fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

    fn tokenize_sentence<'t>(&'t self, sentence: &'t str) -> Option<Sentence<'t>>;

    fn test(&self) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Transform {
    // omitted

    fn transform<'t>(&'t self, sentence: Sentence<'t>) -> Result<Sentence<'t>, properties::Error>;

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Suggest {
    // omitted

    fn suggest(&self, sentence: &Sentence) -> Result<Vec<Suggestion>, properties::Error>;

    fn correct(&self, sentence: &Sentence) -> Result<String, properties::Error> {
        let suggestions = self.suggest(sentence)?;
        Ok(apply_suggestions(&sentence, &suggestions))
    }

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

So:

  • A Tokenizer turns arbitrary text into zero or more sentences (unfortunately, it became really unwieldy / impossible in the pipelines to use a statically dispatched iterator, do you think a Box<dyn Iterator<..>> is ok here?)
  • A Transformer turns a sentence of tokens into a sentence (of the same length!) with potentially different tokens, e.g. setting properties like chunks and tags.
  • A Suggester turns a sentence into a vector of suggestions for that sentence.

Note that all of these traits are also implemented for &T if implemented for T.

Pipelines

The above can be composed using a Pipeline, there are three different kinds of Pipelines:

  • Pipeline: A pipeline consisting of Tokenize -> Transform -> ... -> Transform -> Suggest. This pipeline does not implement any of the traits but has the methods
pub fn suggest<'t>(&'t self, text: &'t str) -> impl Iterator<Item = Vec<Suggestion>> + 't; ; // iterator over sentence suggestions
pub fn correct<'t>(&'t self, text: &'t str) -> impl Iterator<Item = String> + 't; // iterator over sentences
  • tokenize::Pipeline: A pipeline consisting of Tokenize -> Transform -> ... -> Transform. Implements Tokenize.
  • tranform::Pipeline: A pipeline consisting of Transform -> ... -> Transform. Implements Transform.
  • There is no pipeline for Suggest since that could easily enough be created from the others.

I tried implementing this as one Pipeline with different implementations depending on what it is constructed from but didn't get around conflicting implementations, and it is arguably clearer to have different structs anyway.

Each pipeline also has a test function which calls all the test functions of constituents correctly. Pipeline components can of course also be accessed from the pipeline. Pipelines are implemented using the same trick as in the standard lib with the Hash impl for tuples with lengths up to 8:

impl_pipeline! { A, B, }
impl_pipeline! { A, C, B  }
impl_pipeline! { A, D, B, C }
impl_pipeline! { A, E, B, C, D }
// ...

Property Errors

One problem with having this kind of composability is that it is not trivial to say whether the complete Pipeline will work for any input or not. The easy way out would be to just return a Result<..> for every call to correct, tokenize etc. but I wanted a nicer solution that guarantees (where this is possible) that once constructed, a Pipeline works for any input.

I solved this using so-called property guards. A Tokenizer and Transformer must declare which properties it will read/write:

pub trait Tokenize {
    // these are the functions that are omitted above
    fn properties(&self) -> PropertiesMut {
        PropertiesMut::default()
    }

    fn property_guard(&self, sentence: &mut Sentence) -> Result<PropertyGuardMut, Error> {
        self.properties().build(sentence)
    }
}

// same for `Tokenize`

for example

impl Transform for Chunker {
    fn properties(&self) -> PropertiesMut {
        lazy_static! {
            static ref PROPERTIES: PropertiesMut = Properties::default()
                .read(&[Property::Tags])
                .write(&[Property::Chunks]);
        }
        *PROPERTIES
    }

   // ...
}

For Suggest, there is the same but naturally without the option to write anything (and thus called Properties):

pub trait Suggest {
    fn properties(&self) -> Properties {
        Properties::default()
    }


    fn property_guard(&self, sentence: &Sentence) -> Result<PropertyGuard, Error> {
        self.properties().build(sentence)
    }

   // ..
}

Behind the scenes PropertiesMut and Properties are just a bit set.

Now property_guard can be called on a sentence to:

  1. Initialize the properties that are being written if they are not yet initialized.
  2. Obtain a PropertyGuard which can access the properties like e.g. *props.chunks_mut(token)? = (*chunk).clone();

Accessing properties using the property guard returns a Result<.., properties::Error> not Result<.., crate::Error>. The Transform and Suggest traits also return a properties::Error on failure, so the token.tags() and related methods can not be used from components, properties have to be accessed through a property guard.

Having this setup, it is possible to chain the Properties to check whether some Pipeline or tokenize::Pipeline is valid without depending on any concrete input.

This does not solve all issues, since for example the English LT rules expect a multiword tagger to be used, but the multiword tagger only modifies the tags, so an English pipeline with Rules but without a MultiwordTagger would lead to missing / wrong suggestions in some cases without a hard fail. But omitting the chunker would lead to a hard fail. This could maybe be improved by not allowing to modify any properties, only setting them once (then multiword_tags would have to be a new property), but I didn't want to change this for the initial modularization, it will just have to be well documented for now.

Declaring properties beforehand also has the nice side effect that one will be able to, for example, create a Rules set consisting of only rules that do not the chunker, and then omitting the chunker entirely in the pipeline.

Binary hosting

I haven't fully implemented this yet, but binaries will move to being part of the crate on crates.io and there will be template pipelines for each language:

use nlprule::lang::en;

// behind the scenes: `include_bytes!` of the components, and creation of a pipeline
let correcter = en::correcter();
 // "analyzer" terminology is new; corresponds to the `Tokenizer` from before, since now the `Tokenizer` component is the actual tokenizer in the common meaning of the word
let analyzer = en::analyzer();

and feature flags for all binaries, or for only the binaries of some specific language named binaries-all, binaries-en, binaries-de etc. There is also pub fn binary_path(lang_code: &str, name: &str) -> PathBuf in nlprule::lang which could be used in a build.rs to access the binaries and do some compression or other modification. nlprule-build will be removed completely.


I hope I did a reasonably good job of summarizing this, of course if anything is unclear please ask. I'm still open to changes to anything of the above, but I myself do not see any big issues right now.

@bminixhofer
Copy link
Owner Author

Forgot to ping you: @drahnr.

Copy link
Contributor

@drahnr drahnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of superficial review

let tagger = Tagger::build(serde_json::from_value(paths_value.clone())?, None)?;
let mut build_info = BuildInfo::new(&tagger, &paths.regex_cache)?;

macro_rules! build {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A generic fn would work too if supplied with a Buildable trait bound?

}
// println!("Tokens: {:#?}", tokens.collect::<Vec<_>>());
// println!("Suggestions: {:#?}", rules.suggest(&opts.text, &tokenizer));
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on re-using this? If not, it tends to be cleaner in the long run to delete + commit and revert as needed than to comment + commit.


/// Split a text at the points where the given function is true.
/// Keeps the separators. See https://stackoverflow.com/a/40296745.
fn split<F>(text: &str, split_func: F) -> Vec<&str>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption of a single character always true? There are the unicode bold variants of ?!.

/// - Behavior for trailing whitespace is not defined. Can be included in the last sentence or not be part of any sentence.
pub struct SentenceIter<'t> {
text: &'t str,
splits: Vec<Range<usize>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to use two type aliases: CharRange and ByteRange to disambiguate throughout the code.

@drahnr
Copy link
Contributor

drahnr commented Jun 11, 2021

Hi, thanks for taking the time! So a sanity check of the general implementation would be great:

Properties

There is not anymore any distinction between Token / IncompleteToken and Sentence / IncompleteSentence. Instead, there is just a Sentence consisting of Tokens. Tokens now look like this:

#[derive(Debug, Clone, PartialEq)]
pub struct Token<'t> {
    text: &'t str,
    span: Span,
    is_sentence_start: bool,
    is_sentence_end: bool,
    has_space_before: bool,
    tags: Option<Tags<'t>>,
    chunks: Option<Vec<String>>,
}

i.e. there are some attributes which are always set (is_sentence_start etc.) and some which may not be set (tags and chunks). These are referred to as properties. and can be accessed like:

impl<'t> Token<'t> {
    pub fn tags(&self) -> Result<&Tags<'t>, crate::Error>;
    pub fn tags_mut(&mut self) -> Result<&mut Tags<'t>, crate::Error>;
    pub fn chunks(&self) -> Result<&[String], crate::Error>;
    pub fn chunks_mut(&mut self) -> Result<&mut Vec<String>, crate::Error>;
}

Component, Transform, Tokenize, Suggest traits

There is now a Component trait which describes the pragmatics for an NLPRule binary: needs to have a name, be creatable from a reader, and writable:

pub trait Component: Serialize + DeserializeOwned + Clone {
    fn name() -> &'static str;


    fn new<P: AsRef<Path>>(p: P) -> Result<Self, crate::Error> {
        let reader = BufReader::new(File::open(p.as_ref())?);
        Self::from_reader(reader)
    }


    fn from_reader<R: Read>(reader: R) -> Result<Self, crate::Error> {
        Ok(bincode::deserialize_from(reader)?)
    }


    fn to_writer<W: Write>(&self, writer: W) -> Result<(), crate::Error> {
        Ok(bincode::serialize_into(writer, self)?)
    }
}

There are five components currently: chunker, multiword_tagger, rules, tagger, tokenizer.

there are three new traits that describe the different kinds of component something can be: Tokenize, Transform and Suggest:

pub trait Tokenize {
    // omitted

    fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

    fn tokenize_sentence<'t>(&'t self, sentence: &'t str) -> Option<Sentence<'t>>;

    fn test(&self) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Transform {
    // omitted

    fn transform<'t>(&'t self, sentence: Sentence<'t>) -> Result<Sentence<'t>, properties::Error>;

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Suggest {
    // omitted

    fn suggest(&self, sentence: &Sentence) -> Result<Vec<Suggestion>, properties::Error>;

    fn correct(&self, sentence: &Sentence) -> Result<String, properties::Error> {
        let suggestions = self.suggest(sentence)?;
        Ok(apply_suggestions(&sentence, &suggestions))
    }

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

Having those test methods feels odd. There must be a better way to enforce the trait bound.

So:

* A `Tokenize`r turns arbitrary text into zero or more sentences (unfortunately, it became really unwieldy / impossible in the pipelines to use a statically dispatched iterator, do you think a `Box<dyn Iterator<..>>` is ok here?)

Is there a limitation on consuming and returning an impl Iterator<Item=T> + Send + 'static?

* A `Transform`er turns a sentence of tokens into a sentence (of the same length!) with potentially different tokens, e.g. setting properties like chunks and tags.

* A `Suggest`er turns a sentence into a vector of suggestions for that sentence.

Can these suggestions be overlapping? I'd assume so, and that should be stated explicitly.

Note that all of these traits are also implemented for &T if implemented for T.

That's a good practice.

Pipelines

The above can be composed using a Pipeline, there are three different kinds of Pipelines:

* `Pipeline`: A pipeline consisting of `Tokenize -> Transform -> ... -> Transform -> Suggest`. This pipeline does not implement any of the traits but has the methods

I'd argue that you should limit yourself to the simple Tokenize -> Transform -> Suggest path, and impl a TransformChain consisteing of N Transform steps.

pub fn suggest<'t>(&'t self, text: &'t str) -> impl Iterator<Item = Vec<Suggestion>> + 't; ; // iterator over sentence suggestions
pub fn correct<'t>(&'t self, text: &'t str) -> impl Iterator<Item = String> + 't; // iterator over sentences
* `tokenize::Pipeline`: A pipeline consisting of `Tokenize -> Transform -> ... -> Transform`. Implements `Tokenize`.

* `tranform::Pipeline`: A pipeline consisting of `Transform -> ... -> Transform`. Implements `Transform`.

* There is no pipeline for `Suggest` since that could easily enough be created from the others.

I tried implementing this as one Pipeline with different implementations depending on what it is constructed from but didn't get around conflicting implementations, and it is arguably clearer to have different structs anyway.

Each pipeline also has a test function which calls all the test functions of constituents correctly. Pipeline components can of course also be accessed from the pipeline. Pipelines are implemented using the same trick as in the standard lib with the Hash impl for tuples with lengths up to 8:

impl_pipeline! { A, B, }
impl_pipeline! { A, C, B  }
impl_pipeline! { A, D, B, C }
impl_pipeline! { A, E, B, C, D }
// ...

Property Errors

One problem with having this kind of composability is that it is not trivial to say whether the complete Pipeline will work for any input or not. The easy way out would be to just return a Result<..> for every call to correct, tokenize etc. but I wanted a nicer solution that guarantees (where this is possible) that once constructed, a Pipeline works for any input.

I solved this using so-called property guards. A Tokenizer and Transformer must declare which properties it will read/write:

pub trait Tokenize {
    // these are the functions that are omitted above
    fn properties(&self) -> PropertiesMut {
        PropertiesMut::default()
    }

    fn property_guard(&self, sentence: &mut Sentence) -> Result<PropertyGuardMut, Error> {
        self.properties().build(sentence)
    }
}

// same for `Tokenize`

for example

impl Transform for Chunker {
    fn properties(&self) -> PropertiesMut {
        lazy_static! {
            static ref PROPERTIES: PropertiesMut = Properties::default()
                .read(&[Property::Tags])
                .write(&[Property::Chunks]);
        }
        *PROPERTIES
    }

   // ...
}

I am not sure I can follow. The properties exist merely to assure propagation of which things are going to be read. But imho this should be doable by specifying a type handful of traits and trait bounds, so this is checked at compile time.

So if you implement a type, it may return a concrete type trait Producer<X> { fn produce() -> Result<X> { .. } } with Producer being implemented for your particular type, X can have all the trait impls that i.e. a particular trait impl of Transformer might need - all at compile time. This is still a bit fuzzy, from the few minutes I spent here, so I might very well miss smth.

For Suggest, there is the same but naturally without the option to write anything (and thus called Properties):

pub trait Suggest {
    fn properties(&self) -> Properties {
        Properties::default()
    }


    fn property_guard(&self, sentence: &Sentence) -> Result<PropertyGuard, Error> {
        self.properties().build(sentence)
    }

   // ..
}

Behind the scenes PropertiesMut and Properties are just a bit set.

Now property_guard can be called on a sentence to:

1. Initialize the properties that are being written if they are not yet initialized.

2. Obtain a `PropertyGuard` which can access the properties like e.g. `*props.chunks_mut(token)? = (*chunk).clone();`

Accessing properties using the property guard returns a Result<.., properties::Error> not Result<.., crate::Error>. The Transform and Suggest traits also return a properties::Error on failure, so the token.tags() and related methods can not be used from components, properties have to be accessed through a property guard.

Having this setup, it is possible to chain the Properties to check whether some Pipeline or tokenize::Pipeline is valid without depending on any concrete input.

This does not solve all issues, since for example the English LT rules expect a multiword tagger to be used, but the multiword tagger only modifies the tags, so an English pipeline with Rules but without a MultiwordTagger would lead to missing / wrong suggestions in some cases without a hard fail. But omitting the chunker would lead to a hard fail. This could maybe be improved by not allowing to modify any properties, only setting them once (then multiword_tags would have to be a new property), but I didn't want to change this for the initial modularization, it will just have to be well documented for now.

Declaring properties beforehand also has the nice side effect that one will be able to, for example, create a Rules set consisting of only rules that do not the chunker, and then omitting the chunker entirely in the pipeline.

I am not sure this pays off. It might be just easier to use a default passthrough chunker (especially if you consider the compile time impl outlined above)

Binary hosting

I haven't fully implemented this yet, but binaries will move to being part of the crate on crates.io and there will be template pipelines for each language:

use nlprule::lang::en;

// behind the scenes: `include_bytes!` of the components, and creation of a pipeline
let correcter = en::correcter();
 // "analyzer" terminology is new; corresponds to the `Tokenizer` from before, since now the `Tokenizer` component is the actual tokenizer in the common meaning of the word
let analyzer = en::analyzer();

and feature flags for all binaries, or for only the binaries of some specific language named binaries-all, binaries-en, binaries-de etc. There is also pub fn binary_path(lang_code: &str, name: &str) -> PathBuf in nlprule::lang which could be used in a build.rs to access the binaries and do some compression or other modification. nlprule-build will be removed completely.

I fully agree with this, but that changeset seems a bit orthogonal to the rest. It might make sense to split that off.

This is an initial review, I'll try to find some time to dig a bit deeper.


Very much like these writeups! You are doing a pretty darn good job there. A small nit: From the end users point of view, the API (and hence the pipeline creation) is a key element here and should be smooth and generate sane errors. If error types mismatch, that's usually a headscratch type of experience. Trait bounds are preferable since they are explicit of the requirements of what is passed in and can easily be obtained via docs.rs i.e.

@drahnr
Copy link
Contributor

drahnr commented Jun 13, 2021

Notify @bminixhofer

@bminixhofer
Copy link
Owner Author

bminixhofer commented Jun 14, 2021

Thanks for the feedback! I'm a bit busy at the moment so I'll need longer to respond sometimes.

I'll answer your comments to the write up for now, I'll look at the code later (also, the code isn't fully ready for review yet).

Having those test methods feels odd. There must be a better way to enforce the trait bound.

I don't understand what you mean here. But I also didn't explain what the test methods do: They run the tests for the component (for example, each rule has unit tests). Pipelines also have a test method, which calls all the component .test() methods in order. So pipeline.test() replaces the previous test.rs and test_disambiguation binaries. And if .test() passes, the component / pipeline definitely works correctly.

Is there a limitation on consuming and returning an impl Iterator<Item=T> + Send + 'static?

The trait bound is

fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

'static isn't needed here, I'm not so sure about Send and Sync. These iterators are only returned by the tokenize functions, and never explicitly consumed since Transform and Suggest operate on sentence-level. They are iterated over internally in the pipelines.

Can these suggestions be overlapping? I'd assume so, and that should be stated explicitly.

Suggestions can not be overlapping. Should still be stated explicitly though :)

I'd argue that you should limit yourself to the simple Tokenize -> Transform -> Suggest path, and impl a TransformChain consisting of N Transform steps.

That's a good point, I really like that! My rationale for having these three kinds of pipelines was that nlprule should also be usable for NLU (similar to Spacy) to do chunking, pos-tagging, etc. But having only one Pipeline which is always Tokenize -> Transform -> Suggest and TransformChain, plus a method analyze (or process, pipe) on the pipeline which does only the Tokenize -> Transform steps should be sufficient. There'd then also be an IdentitySuggester which never suggests anything s.t. there's no overhead for someone using nlprule without needing suggestions. That's less overhead in the code, and removes potential confusion from having three things named Pipeline.

I am not sure I can follow. The properties exist merely to assure propagation of which things are going to be read. But imho this should be doable by specifying a type handful of traits and trait bounds, so this is checked at compile time.

So if you implement a type, it may return a concrete type trait Producer { fn produce() -> Result { .. } } with Producer being implemented for your particular type, X can have all the trait impls that i.e. a particular trait impl of Transformer might need - all at compile time. This is still a bit fuzzy, from the few minutes I spent here, so I might very well miss smth.

I actually tried checking it at compile time first: The problem here is that individual Rules need different properties (either tags and chunks, only tags, only chunks, none), the Rules struct would then have to contain a Vec<Box<dyn RuleTrait>> (or something like that), and require the union of all of the properties that the individual rules need. This might still be possible to do at compile time, but I didn't find a good solution to that, and it would probably be a good deal more complex than the way it is currently.

Saying that any Rules container needs all the properties each rule can possibly need is one way out, but this prevents for example having Rules which only run the rules which do not need a chunker, if using the chunker is not fast enough for your application. To be fair, it is arguable whether that use-case is worth supporting though.

I fully agree with this, but that changeset seems a bit orthogonal to the rest. It might make sense to split that off.

At least release-wise I'll do this at once, since adding all the extra binaries to the current setup would probably be more work than just changing it. But yes, PR-wise it makes sense to split it off.

From the end users point of view, the API (and hence the pipeline creation) is a key element here and should be smooth and generate sane errors. If error types mismatch, that's usually a headscratch type of experience. Trait bounds are preferable since they are explicit of the requirements of what is passed in and can easily be obtained via docs.rs i.e.

I'll keep that in mind. The current distinction between crate::Error and crate::property::Error is only internal to the components, so it is not relevant, except when creating a new component, or using a component directly, instead of via the pipeline.


I really appreciate the time you're putting into this. The main takeaway for now is that I'll remove the Tokenize -> Transform pipeline and rename the transform::Pipeline to TransformChain. I'll also hopefully get around to doing some of the other cleanup that's still needed soon.

@bminixhofer bminixhofer mentioned this pull request Jun 26, 2021
@thecodrr
Copy link

thecodrr commented Mar 8, 2023

@bminixhofer what's the status on this?

@bminixhofer
Copy link
Owner Author

Hi! I am currently not actively working on nlprule - I hope to circle back at some point but I can't promise any timeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants