Modularize #72

bminixhofer · 2021-05-29T10:09:23Z

This is the main modularization PR. Fixes #50.

I've been quite busy lately but I've gotten around to doing what has become to some degree a rewrite now :)

Now there are components called Chunker, Tagger, Tokenizer, MultiwordTagger, Disambiguator and Rules which can be composed using a Pipeline like:

let correcter = Pipeline::new((
    Tokenizer::new("tokenizer.bin")?,
    MultiwordTagger::new("multiword_tagger.bin")?,
    Chunker::new("chunker.bin")?,
    Disambiguator::new("disambiguator.bin")?,
    Rules::new("rules.bin")?
))?;

correcter.suggest("...")

In addition, binary hosting will move to crates.io with a tighter integration in the crate to avoid having to deal with an increasing amount of binaries:

// behind a `binaries` feature flag
// maybe also feature flags for each language
use nlprule::lang::en;

let correcter = en::correcter();

There's still some work left to do:

And I'm still open to changes in the top-level API. However, the hard part (separating the components) is definitely done.

This will also first be made available as pre-release.

@drahnr IIRC you offered to review this in the past, but there is probably a huge diff so I'm not sure if a full review would make sense (also, this is not yet ready for review). I'll write up the most important points of change in the coming days, I would greatly appreciate your feedback on that. There's also some smaller issues I'm still undecided about among those.

drahnr · 2021-05-30T16:44:51Z

Ping me once you have something you want to be reviewed :)

bminixhofer · 2021-06-07T20:29:45Z

Hi, thanks for taking the time! So a sanity check of the general implementation would be great:

Properties

There is not anymore any distinction between Token / IncompleteToken and Sentence / IncompleteSentence. Instead, there is just a Sentence consisting of Tokens. Tokens now look like this:

#[derive(Debug, Clone, PartialEq)]
pub struct Token<'t> {
    text: &'t str,
    span: Span,
    is_sentence_start: bool,
    is_sentence_end: bool,
    has_space_before: bool,
    tags: Option<Tags<'t>>,
    chunks: Option<Vec<String>>,
}

i.e. there are some attributes which are always set (is_sentence_start etc.) and some which may not be set (tags and chunks). These are referred to as properties. and can be accessed like:

impl<'t> Token<'t> {
    pub fn tags(&self) -> Result<&Tags<'t>, crate::Error>;
    pub fn tags_mut(&mut self) -> Result<&mut Tags<'t>, crate::Error>;
    pub fn chunks(&self) -> Result<&[String], crate::Error>;
    pub fn chunks_mut(&mut self) -> Result<&mut Vec<String>, crate::Error>;
}

`Component`, `Transform`, `Tokenize`, `Suggest` traits

There is now a Component trait which describes the pragmatics for an NLPRule binary: needs to have a name, be creatable from a reader, and writable:

pub trait Component: Serialize + DeserializeOwned + Clone {
    fn name() -> &'static str;


    fn new<P: AsRef<Path>>(p: P) -> Result<Self, crate::Error> {
        let reader = BufReader::new(File::open(p.as_ref())?);
        Self::from_reader(reader)
    }


    fn from_reader<R: Read>(reader: R) -> Result<Self, crate::Error> {
        Ok(bincode::deserialize_from(reader)?)
    }


    fn to_writer<W: Write>(&self, writer: W) -> Result<(), crate::Error> {
        Ok(bincode::serialize_into(writer, self)?)
    }
}

There are five components currently: chunker, multiword_tagger, rules, tagger, tokenizer.

there are three new traits that describe the different kinds of component something can be: Tokenize, Transform and Suggest:

pub trait Tokenize {
    // omitted

    fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

    fn tokenize_sentence<'t>(&'t self, sentence: &'t str) -> Option<Sentence<'t>>;

    fn test(&self) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Transform {
    // omitted

    fn transform<'t>(&'t self, sentence: Sentence<'t>) -> Result<Sentence<'t>, properties::Error>;

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Suggest {
    // omitted

    fn suggest(&self, sentence: &Sentence) -> Result<Vec<Suggestion>, properties::Error>;

    fn correct(&self, sentence: &Sentence) -> Result<String, properties::Error> {
        let suggestions = self.suggest(sentence)?;
        Ok(apply_suggestions(&sentence, &suggestions))
    }

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

So:

A Tokenizer turns arbitrary text into zero or more sentences (unfortunately, it became really unwieldy / impossible in the pipelines to use a statically dispatched iterator, do you think a Box<dyn Iterator<..>> is ok here?)
A Transformer turns a sentence of tokens into a sentence (of the same length!) with potentially different tokens, e.g. setting properties like chunks and tags.
A Suggester turns a sentence into a vector of suggestions for that sentence.

Note that all of these traits are also implemented for &T if implemented for T.

Pipelines

The above can be composed using a Pipeline, there are three different kinds of Pipelines:

Pipeline: A pipeline consisting of Tokenize -> Transform -> ... -> Transform -> Suggest. This pipeline does not implement any of the traits but has the methods

pub fn suggest<'t>(&'t self, text: &'t str) -> impl Iterator<Item = Vec<Suggestion>> + 't; ; // iterator over sentence suggestions
pub fn correct<'t>(&'t self, text: &'t str) -> impl Iterator<Item = String> + 't; // iterator over sentences

tokenize::Pipeline: A pipeline consisting of Tokenize -> Transform -> ... -> Transform. Implements Tokenize.
tranform::Pipeline: A pipeline consisting of Transform -> ... -> Transform. Implements Transform.
There is no pipeline for Suggest since that could easily enough be created from the others.

I tried implementing this as one Pipeline with different implementations depending on what it is constructed from but didn't get around conflicting implementations, and it is arguably clearer to have different structs anyway.

Each pipeline also has a test function which calls all the test functions of constituents correctly. Pipeline components can of course also be accessed from the pipeline. Pipelines are implemented using the same trick as in the standard lib with the Hash impl for tuples with lengths up to 8:

impl_pipeline! { A, B, }
impl_pipeline! { A, C, B  }
impl_pipeline! { A, D, B, C }
impl_pipeline! { A, E, B, C, D }
// ...

Property Errors

One problem with having this kind of composability is that it is not trivial to say whether the complete Pipeline will work for any input or not. The easy way out would be to just return a Result<..> for every call to correct, tokenize etc. but I wanted a nicer solution that guarantees (where this is possible) that once constructed, a Pipeline works for any input.

I solved this using so-called property guards. A Tokenizer and Transformer must declare which properties it will read/write:

pub trait Tokenize {
    // these are the functions that are omitted above
    fn properties(&self) -> PropertiesMut {
        PropertiesMut::default()
    }

    fn property_guard(&self, sentence: &mut Sentence) -> Result<PropertyGuardMut, Error> {
        self.properties().build(sentence)
    }
}

// same for `Tokenize`

for example

impl Transform for Chunker {
    fn properties(&self) -> PropertiesMut {
        lazy_static! {
            static ref PROPERTIES: PropertiesMut = Properties::default()
                .read(&[Property::Tags])
                .write(&[Property::Chunks]);
        }
        *PROPERTIES
    }

   // ...
}

For Suggest, there is the same but naturally without the option to write anything (and thus called Properties):

pub trait Suggest {
    fn properties(&self) -> Properties {
        Properties::default()
    }


    fn property_guard(&self, sentence: &Sentence) -> Result<PropertyGuard, Error> {
        self.properties().build(sentence)
    }

   // ..
}

Behind the scenes PropertiesMut and Properties are just a bit set.

Now property_guard can be called on a sentence to:

Initialize the properties that are being written if they are not yet initialized.
Obtain a PropertyGuard which can access the properties like e.g. *props.chunks_mut(token)? = (*chunk).clone();

Accessing properties using the property guard returns a Result<.., properties::Error> not Result<.., crate::Error>. The Transform and Suggest traits also return a properties::Error on failure, so the token.tags() and related methods can not be used from components, properties have to be accessed through a property guard.

Having this setup, it is possible to chain the Properties to check whether some Pipeline or tokenize::Pipeline is valid without depending on any concrete input.

This does not solve all issues, since for example the English LT rules expect a multiword tagger to be used, but the multiword tagger only modifies the tags, so an English pipeline with Rules but without a MultiwordTagger would lead to missing / wrong suggestions in some cases without a hard fail. But omitting the chunker would lead to a hard fail. This could maybe be improved by not allowing to modify any properties, only setting them once (then multiword_tags would have to be a new property), but I didn't want to change this for the initial modularization, it will just have to be well documented for now.

Declaring properties beforehand also has the nice side effect that one will be able to, for example, create a Rules set consisting of only rules that do not the chunker, and then omitting the chunker entirely in the pipeline.

Binary hosting

I haven't fully implemented this yet, but binaries will move to being part of the crate on crates.io and there will be template pipelines for each language:

use nlprule::lang::en;

// behind the scenes: `include_bytes!` of the components, and creation of a pipeline
let correcter = en::correcter();
 // "analyzer" terminology is new; corresponds to the `Tokenizer` from before, since now the `Tokenizer` component is the actual tokenizer in the common meaning of the word
let analyzer = en::analyzer();

and feature flags for all binaries, or for only the binaries of some specific language named binaries-all, binaries-en, binaries-de etc. There is also pub fn binary_path(lang_code: &str, name: &str) -> PathBuf in nlprule::lang which could be used in a build.rs to access the binaries and do some compression or other modification. nlprule-build will be removed completely.

I hope I did a reasonably good job of summarizing this, of course if anything is unclear please ask. I'm still open to changes to anything of the above, but I myself do not see any big issues right now.

bminixhofer · 2021-06-10T06:48:34Z

Forgot to ping you: @drahnr.

drahnr

A first round of superficial review

drahnr · 2021-06-11T07:09:29Z

nlprule/src/bin/compile.rs

+    let tagger = Tagger::build(serde_json::from_value(paths_value.clone())?, None)?;
+    let mut build_info = BuildInfo::new(&tagger, &paths.regex_cache)?;
+
+    macro_rules! build {


nit: A generic fn would work too if supplied with a Buildable trait bound?

drahnr · 2021-06-11T07:11:15Z

nlprule/src/bin/run.rs

-}
+//     println!("Tokens: {:#?}", tokens.collect::<Vec<_>>());
+//     println!("Suggestions: {:#?}", rules.suggest(&opts.text, &tokenizer));
+// }


Are you planning on re-using this? If not, it tends to be cleaner in the long run to delete + commit and revert as needed than to comment + commit.

drahnr · 2021-06-11T07:20:09Z

nlprule/src/components/tokenizer/mod.rs

+
+/// Split a text at the points where the given function is true.
+/// Keeps the separators. See https://stackoverflow.com/a/40296745.
+fn split<F>(text: &str, split_func: F) -> Vec<&str>


Is the assumption of a single character always true? There are the unicode bold variants of ?!.

drahnr · 2021-06-11T07:21:05Z

nlprule/src/components/tokenizer/mod.rs

+/// - Behavior for trailing whitespace is not defined. Can be included in the last sentence or not be part of any sentence.
+pub struct SentenceIter<'t> {
+    text: &'t str,
+    splits: Vec<Range<usize>>,


I'd recommend to use two type aliases: CharRange and ByteRange to disambiguate throughout the code.

drahnr · 2021-06-11T09:26:38Z

Hi, thanks for taking the time! So a sanity check of the general implementation would be great:

Properties

There is not anymore any distinction between Token / IncompleteToken and Sentence / IncompleteSentence. Instead, there is just a Sentence consisting of Tokens. Tokens now look like this:

#[derive(Debug, Clone, PartialEq)]
pub struct Token<'t> {
    text: &'t str,
    span: Span,
    is_sentence_start: bool,
    is_sentence_end: bool,
    has_space_before: bool,
    tags: Option<Tags<'t>>,
    chunks: Option<Vec<String>>,
}

i.e. there are some attributes which are always set (is_sentence_start etc.) and some which may not be set (tags and chunks). These are referred to as properties. and can be accessed like:

impl<'t> Token<'t> {
    pub fn tags(&self) -> Result<&Tags<'t>, crate::Error>;
    pub fn tags_mut(&mut self) -> Result<&mut Tags<'t>, crate::Error>;
    pub fn chunks(&self) -> Result<&[String], crate::Error>;
    pub fn chunks_mut(&mut self) -> Result<&mut Vec<String>, crate::Error>;
}

`Component`, `Transform`, `Tokenize`, `Suggest` traits

There is now a Component trait which describes the pragmatics for an NLPRule binary: needs to have a name, be creatable from a reader, and writable:

pub trait Component: Serialize + DeserializeOwned + Clone {
    fn name() -> &'static str;


    fn new<P: AsRef<Path>>(p: P) -> Result<Self, crate::Error> {
        let reader = BufReader::new(File::open(p.as_ref())?);
        Self::from_reader(reader)
    }


    fn from_reader<R: Read>(reader: R) -> Result<Self, crate::Error> {
        Ok(bincode::deserialize_from(reader)?)
    }


    fn to_writer<W: Write>(&self, writer: W) -> Result<(), crate::Error> {
        Ok(bincode::serialize_into(writer, self)?)
    }
}

There are five components currently: chunker, multiword_tagger, rules, tagger, tokenizer.

there are three new traits that describe the different kinds of component something can be: Tokenize, Transform and Suggest:

pub trait Tokenize {
    // omitted

    fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

    fn tokenize_sentence<'t>(&'t self, sentence: &'t str) -> Option<Sentence<'t>>;

    fn test(&self) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Transform {
    // omitted

    fn transform<'t>(&'t self, sentence: Sentence<'t>) -> Result<Sentence<'t>, properties::Error>;

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

pub trait Suggest {
    // omitted

    fn suggest(&self, sentence: &Sentence) -> Result<Vec<Suggestion>, properties::Error>;

    fn correct(&self, sentence: &Sentence) -> Result<String, properties::Error> {
        let suggestions = self.suggest(sentence)?;
        Ok(apply_suggestions(&sentence, &suggestions))
    }

    fn test<TOK: Tokenize>(&self, tokenizer: TOK) -> Result<(), crate::Error> {
        Ok(())
    }
}

Having those test methods feels odd. There must be a better way to enforce the trait bound.

So:

* A `Tokenize`r turns arbitrary text into zero or more sentences (unfortunately, it became really unwieldy / impossible in the pipelines to use a statically dispatched iterator, do you think a `Box<dyn Iterator<..>>` is ok here?)

Is there a limitation on consuming and returning an impl Iterator<Item=T> + Send + 'static?

* A `Transform`er turns a sentence of tokens into a sentence (of the same length!) with potentially different tokens, e.g. setting properties like chunks and tags.

* A `Suggest`er turns a sentence into a vector of suggestions for that sentence.

Can these suggestions be overlapping? I'd assume so, and that should be stated explicitly.

Note that all of these traits are also implemented for &T if implemented for T.

That's a good practice.

Pipelines

The above can be composed using a Pipeline, there are three different kinds of Pipelines:
* `Pipeline`: A pipeline consisting of `Tokenize -> Transform -> ... -> Transform -> Suggest`. This pipeline does not implement any of the traits but has the methods

I'd argue that you should limit yourself to the simple Tokenize -> Transform -> Suggest path, and impl a TransformChain consisteing of N Transform steps.

pub fn suggest<'t>(&'t self, text: &'t str) -> impl Iterator<Item = Vec<Suggestion>> + 't; ; // iterator over sentence suggestions
pub fn correct<'t>(&'t self, text: &'t str) -> impl Iterator<Item = String> + 't; // iterator over sentences
* `tokenize::Pipeline`: A pipeline consisting of `Tokenize -> Transform -> ... -> Transform`. Implements `Tokenize`.

* `tranform::Pipeline`: A pipeline consisting of `Transform -> ... -> Transform`. Implements `Transform`.

* There is no pipeline for `Suggest` since that could easily enough be created from the others.
I tried implementing this as one Pipeline with different implementations depending on what it is constructed from but didn't get around conflicting implementations, and it is arguably clearer to have different structs anyway.

Each pipeline also has a test function which calls all the test functions of constituents correctly. Pipeline components can of course also be accessed from the pipeline. Pipelines are implemented using the same trick as in the standard lib with the Hash impl for tuples with lengths up to 8:
impl_pipeline! { A, B, }
impl_pipeline! { A, C, B  }
impl_pipeline! { A, D, B, C }
impl_pipeline! { A, E, B, C, D }
// ...
Property Errors

One problem with having this kind of composability is that it is not trivial to say whether the complete Pipeline will work for any input or not. The easy way out would be to just return a Result<..> for every call to correct, tokenize etc. but I wanted a nicer solution that guarantees (where this is possible) that once constructed, a Pipeline works for any input.

I solved this using so-called property guards. A Tokenizer and Transformer must declare which properties it will read/write:
pub trait Tokenize {
    // these are the functions that are omitted above
    fn properties(&self) -> PropertiesMut {
        PropertiesMut::default()
    }

    fn property_guard(&self, sentence: &mut Sentence) -> Result<PropertyGuardMut, Error> {
        self.properties().build(sentence)
    }
}

// same for `Tokenize`
for example
impl Transform for Chunker {
    fn properties(&self) -> PropertiesMut {
        lazy_static! {
            static ref PROPERTIES: PropertiesMut = Properties::default()
                .read(&[Property::Tags])
                .write(&[Property::Chunks]);
        }
        *PROPERTIES
    }

   // ...
}

I am not sure I can follow. The properties exist merely to assure propagation of which things are going to be read. But imho this should be doable by specifying a type handful of traits and trait bounds, so this is checked at compile time.

So if you implement a type, it may return a concrete type trait Producer<X> { fn produce() -> Result<X> { .. } } with Producer being implemented for your particular type, X can have all the trait impls that i.e. a particular trait impl of Transformer might need - all at compile time. This is still a bit fuzzy, from the few minutes I spent here, so I might very well miss smth.

For Suggest, there is the same but naturally without the option to write anything (and thus called Properties):
pub trait Suggest {
    fn properties(&self) -> Properties {
        Properties::default()
    }


    fn property_guard(&self, sentence: &Sentence) -> Result<PropertyGuard, Error> {
        self.properties().build(sentence)
    }

   // ..
}
Behind the scenes PropertiesMut and Properties are just a bit set.

Now property_guard can be called on a sentence to:
1. Initialize the properties that are being written if they are not yet initialized.

2. Obtain a `PropertyGuard` which can access the properties like e.g. `*props.chunks_mut(token)? = (*chunk).clone();`
Accessing properties using the property guard returns a Result<.., properties::Error> not Result<.., crate::Error>. The Transform and Suggest traits also return a properties::Error on failure, so the token.tags() and related methods can not be used from components, properties have to be accessed through a property guard.

Having this setup, it is possible to chain the Properties to check whether some Pipeline or tokenize::Pipeline is valid without depending on any concrete input.

This does not solve all issues, since for example the English LT rules expect a multiword tagger to be used, but the multiword tagger only modifies the tags, so an English pipeline with Rules but without a MultiwordTagger would lead to missing / wrong suggestions in some cases without a hard fail. But omitting the chunker would lead to a hard fail. This could maybe be improved by not allowing to modify any properties, only setting them once (then multiword_tags would have to be a new property), but I didn't want to change this for the initial modularization, it will just have to be well documented for now.

Declaring properties beforehand also has the nice side effect that one will be able to, for example, create a Rules set consisting of only rules that do not the chunker, and then omitting the chunker entirely in the pipeline.

I am not sure this pays off. It might be just easier to use a default passthrough chunker (especially if you consider the compile time impl outlined above)

Binary hosting

I haven't fully implemented this yet, but binaries will move to being part of the crate on crates.io and there will be template pipelines for each language:
use nlprule::lang::en;

// behind the scenes: `include_bytes!` of the components, and creation of a pipeline
let correcter = en::correcter();
 // "analyzer" terminology is new; corresponds to the `Tokenizer` from before, since now the `Tokenizer` component is the actual tokenizer in the common meaning of the word
let analyzer = en::analyzer();
and feature flags for all binaries, or for only the binaries of some specific language named binaries-all, binaries-en, binaries-de etc. There is also pub fn binary_path(lang_code: &str, name: &str) -> PathBuf in nlprule::lang which could be used in a build.rs to access the binaries and do some compression or other modification. nlprule-build will be removed completely.

I fully agree with this, but that changeset seems a bit orthogonal to the rest. It might make sense to split that off.

This is an initial review, I'll try to find some time to dig a bit deeper.

Very much like these writeups! You are doing a pretty darn good job there. A small nit: From the end users point of view, the API (and hence the pipeline creation) is a key element here and should be smooth and generate sane errors. If error types mismatch, that's usually a headscratch type of experience. Trait bounds are preferable since they are explicit of the requirements of what is passed in and can easily be obtained via docs.rs i.e.

drahnr · 2021-06-13T12:23:21Z

Notify @bminixhofer

bminixhofer · 2021-06-14T12:25:05Z

Thanks for the feedback! I'm a bit busy at the moment so I'll need longer to respond sometimes.

I'll answer your comments to the write up for now, I'll look at the code later (also, the code isn't fully ready for review yet).

Having those test methods feels odd. There must be a better way to enforce the trait bound.

I don't understand what you mean here. But I also didn't explain what the test methods do: They run the tests for the component (for example, each rule has unit tests). Pipelines also have a test method, which calls all the component .test() methods in order. So pipeline.test() replaces the previous test.rs and test_disambiguation binaries. And if .test() passes, the component / pipeline definitely works correctly.

Is there a limitation on consuming and returning an impl Iterator<Item=T> + Send + 'static?

The trait bound is

fn tokenize<'t>(&'t self, text: &'t str) -> Box<dyn Iterator<Item = Sentence<'t>> + 't>;

'static isn't needed here, I'm not so sure about Send and Sync. These iterators are only returned by the tokenize functions, and never explicitly consumed since Transform and Suggest operate on sentence-level. They are iterated over internally in the pipelines.

Can these suggestions be overlapping? I'd assume so, and that should be stated explicitly.

Suggestions can not be overlapping. Should still be stated explicitly though :)

I'd argue that you should limit yourself to the simple Tokenize -> Transform -> Suggest path, and impl a TransformChain consisting of N Transform steps.

That's a good point, I really like that! My rationale for having these three kinds of pipelines was that nlprule should also be usable for NLU (similar to Spacy) to do chunking, pos-tagging, etc. But having only one Pipeline which is always Tokenize -> Transform -> Suggest and TransformChain, plus a method analyze (or process, pipe) on the pipeline which does only the Tokenize -> Transform steps should be sufficient. There'd then also be an IdentitySuggester which never suggests anything s.t. there's no overhead for someone using nlprule without needing suggestions. That's less overhead in the code, and removes potential confusion from having three things named Pipeline.

I am not sure I can follow. The properties exist merely to assure propagation of which things are going to be read. But imho this should be doable by specifying a type handful of traits and trait bounds, so this is checked at compile time.

So if you implement a type, it may return a concrete type trait Producer { fn produce() -> Result { .. } } with Producer being implemented for your particular type, X can have all the trait impls that i.e. a particular trait impl of Transformer might need - all at compile time. This is still a bit fuzzy, from the few minutes I spent here, so I might very well miss smth.

I actually tried checking it at compile time first: The problem here is that individual Rules need different properties (either tags and chunks, only tags, only chunks, none), the Rules struct would then have to contain a Vec<Box<dyn RuleTrait>> (or something like that), and require the union of all of the properties that the individual rules need. This might still be possible to do at compile time, but I didn't find a good solution to that, and it would probably be a good deal more complex than the way it is currently.

Saying that any Rules container needs all the properties each rule can possibly need is one way out, but this prevents for example having Rules which only run the rules which do not need a chunker, if using the chunker is not fast enough for your application. To be fair, it is arguable whether that use-case is worth supporting though.

I fully agree with this, but that changeset seems a bit orthogonal to the rest. It might make sense to split that off.

At least release-wise I'll do this at once, since adding all the extra binaries to the current setup would probably be more work than just changing it. But yes, PR-wise it makes sense to split it off.

From the end users point of view, the API (and hence the pipeline creation) is a key element here and should be smooth and generate sane errors. If error types mismatch, that's usually a headscratch type of experience. Trait bounds are preferable since they are explicit of the requirements of what is passed in and can easily be obtained via docs.rs i.e.

I'll keep that in mind. The current distinction between crate::Error and crate::property::Error is only internal to the components, so it is not relevant, except when creating a new component, or using a component directly, instead of via the pipeline.

I really appreciate the time you're putting into this. The main takeaway for now is that I'll remove the Tokenize -> Transform pipeline and rename the transform::Pipeline to TransformChain. I'll also hopefully get around to doing some of the other cleanup that's still needed soon.

thecodrr · 2023-03-08T08:02:02Z

@bminixhofer what's the status on this?

bminixhofer · 2023-03-08T13:05:58Z

Hi! I am currently not actively working on nlprule - I hope to circle back at some point but I can't promise any timeline.

bminixhofer added 15 commits April 24, 2021 14:44

make tags, chunks optional, preliminary separation of tagger

28748a3

add property guards

0d96867

add Transform, Suggest and Tokenize traits

4ab8e33

implement Tokenize

1dd7587

add Pipeline::new

9ff92e6

restructure into components/

b60750d

fix disambiguator properties, unify tests

147b17c

fix pipeline properties, add lang module

2bc7842

add lang module, add setup.sh

5927d69

update ci, feature flag for each language

967cd95

update ci

6ddffdb

add language specific test binaries

9846e6e

fix build_and_test script

4dde97e

update ci

5bf39bc

remove nlprule-build, update tests

aa94184

This was referenced Jun 6, 2021

Compile error in build.rs from README.md #73

Closed

Support Rules written in Rust #75

Open

drahnr reviewed Jun 11, 2021

View reviewed changes

bminixhofer mentioned this pull request Jun 26, 2021

Web demo #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize #72

Modularize #72

bminixhofer commented May 29, 2021 •

edited

Loading

drahnr commented May 30, 2021

bminixhofer commented Jun 7, 2021 •

edited

Loading

bminixhofer commented Jun 10, 2021

drahnr left a comment

drahnr Jun 11, 2021

drahnr Jun 11, 2021

drahnr Jun 11, 2021

drahnr Jun 11, 2021

drahnr commented Jun 11, 2021

Properties

`Component`, `Transform`, `Tokenize`, `Suggest` traits

Pipelines

Property Errors

Binary hosting

drahnr commented Jun 13, 2021

bminixhofer commented Jun 14, 2021 •

edited

Loading

thecodrr commented Mar 8, 2023

bminixhofer commented Mar 8, 2023

Modularize #72

Are you sure you want to change the base?

Modularize #72

Conversation

bminixhofer commented May 29, 2021 • edited Loading

drahnr commented May 30, 2021

bminixhofer commented Jun 7, 2021 • edited Loading

Properties

Component, Transform, Tokenize, Suggest traits

Pipelines

Property Errors

Binary hosting

bminixhofer commented Jun 10, 2021

drahnr left a comment

Choose a reason for hiding this comment

drahnr Jun 11, 2021

Choose a reason for hiding this comment

drahnr Jun 11, 2021

Choose a reason for hiding this comment

drahnr Jun 11, 2021

Choose a reason for hiding this comment

drahnr Jun 11, 2021

Choose a reason for hiding this comment

drahnr commented Jun 11, 2021

Properties

Component, Transform, Tokenize, Suggest traits

Pipelines

Property Errors

Binary hosting

drahnr commented Jun 13, 2021

bminixhofer commented Jun 14, 2021 • edited Loading

thecodrr commented Mar 8, 2023

bminixhofer commented Mar 8, 2023

bminixhofer commented May 29, 2021 •

edited

Loading

bminixhofer commented Jun 7, 2021 •

edited

Loading

`Component`, `Transform`, `Tokenize`, `Suggest` traits

`Component`, `Transform`, `Tokenize`, `Suggest` traits

bminixhofer commented Jun 14, 2021 •

edited

Loading