Bnewm0609/layer slice #44

bnewm0609 · 2023-08-16T23:26:38Z

Begins an implementation of Layer to wrap the List[Entities] and allow for more intuitive slicing (e.g. doc.sentences[:3].text instead of [sent.text for sent in doc.sentences]. This addresses #24 .

The new data structure is implemented in papermage/magelib/layer.py and inherits from python's UserList. The changes into integrate Layer were mainly made in the Document data structure.

One design decision to consider is what to do with chained access: e.g. doc.pages.paragraphs.sentences.tokens. Currently, each access creates a new layer, so doing the above would create a four-dimensional list. Two consequences of this decision:

To get the first token, you would have to write doc.pages.paragraphs.sentences.tokens[0][0][0][0], which is a bit ugly.
doc.pages.paragraphs.pages does not return the original Layer of pages., which is a bit uninutitive

The main question is: "Should chained accessing return the union of all of the entities in a single layer or should it return the entities in the shape of the chained accessing?"

As another example, if the doc is

Paragraph 1: "I am. I was."
Paragraph 2: "You are. You were."

Sentence 1: "I am."
Sentence 2: "I was."
Sentence 3: "You are."
Sentence 4: "You were."

Which should doc.paragraphs.sentences.text return?

# Option 1 - currently implemented
[[["I", "am", "."], ["I", "was", "."]], [["You", "are", "."], ["You", "were", "."]]]

# Option 2
["I", "am", ".", "I", "was", ".", "You", "are", ".", "You", "were", "."]

bnewm0609 · 2023-08-16T23:27:54Z

tests/test_magelib/test_document.py

-        self.assertListEqual(doc.chunks[2].tokens, [tokens[5]])
+        self.assertSequenceEqual(doc.chunks[0].tokens, tokens[0:3])
+        self.assertSequenceEqual(doc.chunks[1].tokens, tokens[3:5])
+        self.assertSequenceEqual(doc.chunks[2].tokens, [tokens[5]])



assertListEqual enforces that both arguments are list, while assertSequenceEqual does not. Because the first argument is a Layer, we use assertSequenceEqual.

bnewm0609 added 2 commits August 16, 2023 15:47

Add first stab at Layer implementation

6bf13f3

Change tests to not enforce list type

2d9bcb1

bnewm0609 commented Aug 16, 2023

View reviewed changes

bnewm0609 linked an issue Aug 16, 2023 that may be closed by this pull request

Layer definition with Slice compatibility #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bnewm0609/layer slice #44

Bnewm0609/layer slice #44

bnewm0609 commented Aug 16, 2023 •

edited

Loading

bnewm0609 Aug 16, 2023

Bnewm0609/layer slice #44

Are you sure you want to change the base?

Bnewm0609/layer slice #44

Conversation

bnewm0609 commented Aug 16, 2023 • edited Loading

bnewm0609 Aug 16, 2023

Choose a reason for hiding this comment

bnewm0609 commented Aug 16, 2023 •

edited

Loading