Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

icu tokenizer may panic on invalid UTF-8 #34

Open
mschoch opened this issue May 17, 2017 · 4 comments
Open

icu tokenizer may panic on invalid UTF-8 #34

mschoch opened this issue May 17, 2017 · 4 comments

Comments

@mschoch
Copy link
Contributor

mschoch commented May 17, 2017

When the icu tokenizer gets invalid utf8 input like:

"something\x96something"

You may get a panic. This seems to depend on the version of ICU you have installed, and may also depend on some default ICU settings and/or environment variables.

Some users have reported that adding the following fixes the issue for them.

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

This issue has been moved from the bleve repo: blevesearch/bleve#185

@joeblew99
Copy link

joeblew99 commented Jun 4, 2017

same related error maybe ?

go get github.com/blevesearch/blevex/icu
# github.com/blevesearch/blevex/icu
../../../blevesearch/blevex/icu/boundary.go:15:11: fatal error: 'unicode/utypes.h' file not found
 #include "unicode/utypes.h"
          ^
1 error generated.

Is it because there is a dependency i need to install maybe ?

@atthakorn
Copy link

By running go test in github.com/blevesearch/blevex/lang/th

I see this panic too in my system,

panic: runtime error: slice bounds out of range

goroutine 21 [running]:
github.com/blevesearch/blevex/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc4200b4178, 0xc4205dc000, 0x31a, 0x31a, 0x0, 0x0, 0x0)
        /var/www/go/src/github.com/blevesearch/blevex/icu/boundary.go:103 +0x67b
github.com/blevesearch/bleve/analysis.(*Analyzer).Analyze(0xc4200b8780, 0xc4205dc000, 0x31a, 0x31a, 0x31a, 0x31a, 0x7cdf9b7326f47234)
        /var/www/go/src/github.com/blevesearch/bleve/analysis/type.go:86 +0xcc
github.com/blevesearch/bleve/document.(*TextField).Analyze(0xc42052f920, 0xf, 0xc4205628fe)
        /var/www/go/src/github.com/blevesearch/bleve/document/field_text.go:72 +0x86
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze.func1(0x9f6e20, 0xc42052f920, 0x1)
        /var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:48 +0x35b
github.com/blevesearch/bleve/index/upsidedown.(*UpsideDownCouch).Analyze(0xc4201c2300, 0xc42051ca80, 0xc420562f38)
        /var/www/go/src/github.com/blevesearch/bleve/index/upsidedown/analysis.go:70 +0x414
github.com/blevesearch/bleve/index.AnalysisWorker(0xc42008e120, 0xc42008e180)
        /var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:106 +0x55
created by github.com/blevesearch/bleve/index.NewAnalysisQueue
        /var/www/go/src/github.com/blevesearch/bleve/index/analysis.go:94 +0xcd

I confirmed the issue is fixed by adding these lines into blevex/icu/boundary.go

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

@mschoch Do you have any plan to include this patch into main stream? It would be really nice, thank you.

atthakorn added a commit to atthakorn/search-engine that referenced this issue Aug 15, 2018
@steveyen
Copy link
Contributor

Thanks @atthakorn -- wondering if for anybody also running into this and who need a temporary workaround, I'd wonder if those lines of init() code are also just invokable from any app code.

@atthakorn
Copy link

atthakorn commented Aug 16, 2018

@steveyen

Wow I did try, following lines are able to be invoked in app code and it works fine. Great thanks (i'm new to Go)

// #cgo LDFLAGS: -licuuc -licudata
// #include "unicode/ucnv.h"
import "C"

func init() {
	C.ucnv_setDefaultName(C.CString("UTF-8"))
}

However, to leave more trail to others , due to blevesearch/blevex is not supported vendoring, at least I try on dep but it failed to meet constraints

$dep ensure -add github.com/blevesearch/blevex

Solving failure: No versions of github.com/blevesearch/blevex met constraints:

To install blevesearch/blevex as extenstion, workaround can be either

  1. copied blevex locally as internal package: internal/blevex/icu , This option is minimal as we can grab only desired extensions e.g. my case i'm using blevex/icu and blevex/lang/th module.

  2. make blevesearch/blevex as git submodule

wherever blevex modules are: copy to local or submodule we can put this workaround patch into separated file e.g. blevex-icu-patch in any package in app layer

So we don't pollute core extension and keep code clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants