Skip to content

Commit

Permalink
Support Unicode signs when parsing usage info
Browse files Browse the repository at this point in the history
In some terms (notably Japanese ones) full-width sings are used instead
of regular '<' and '>' when denoting usage info.
  • Loading branch information
skalee committed Mar 1, 2021
1 parent a39b24b commit 4eb05ec
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 2 deletions.
11 changes: 9 additions & 2 deletions lib/iev/termbase/term_attrs_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,17 @@ def extract_part_of_speech(str)
end

def extract_usage_info(str)
info_rx = /<(.*?)>/
info_rx = %r{
# regular ASCII less and greater than signs
< (?<inner>.*?) >
|
# < and >, i.e. full-width less and greater than signs
# which are used instead of ASCII signs in some CJK terms
\uFF1C (?<inner>.*?) \uFF1E
}x.freeze

remove_from_string(str, info_rx) do |md|
@usage_info = md[1].strip
@usage_info = md[:inner].strip
end
end

Expand Down
9 changes: 9 additions & 0 deletions spec/iev/termbase/term_attrs_parser_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,15 @@
string: "a whatever" do
expect(subject.usage_info).to be(nil)
end

it "supports full-width signs", string: "a string \uFF1Cinfo\uFF1E" do
expect(subject.usage_info).to eq("info")
end

it "disallows mixing regular and full-width signs",
string: "a string \uFF1Cinfo>" do
expect(subject.usage_info).to be(nil)
end
end

describe "geographical area" do
Expand Down

0 comments on commit 4eb05ec

Please sign in to comment.