SlideShare une entreprise Scribd logo
1  sur  99
Télécharger pour lire hors ligne
Japanese linguistics
in Apache Lucene™ and Apache Solr™

             May 9th, 2012

             Christian Moen
          christian@atilika.com
About me
•   MSc. in computer science, University of Oslo, Norway
•   Worked with search at FAST (now Microsoft) for 10 years
     •   5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
     •   5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
•   Founded アティリカ株式会社 in 2009
     •   We help companies innovate using search technologies and good ideas
     •   We know information retrieval, natural language processing and big data
     •   We are based in Tokyo, but we have clients everywhere
•   Newbie Lucene & Solr Committer
     •   Mostly been working on Japanese language support (Kuromoji) so far
•   Please write me on christian@atilika.com or cm@apache.org
Today’s topics
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Japanese 101
ビールください
 bi-ru kudasai
ビールください
 bi-ru kudasai

A beer, please
ありがとうございます!
 arigatō gozaimasu!
ありがとうございます!
 arigatō gozaimasu!

Thank you very much!
乾杯!
kanpai!
乾杯!
kanpai!

Cheers!
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

  Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.



 JR新宿駅の近くにビールを飲みに行こうか?
Katakana - カタカナ
          ・Phonetic script (~50)
          ・Typically used for loan words



JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?


          Hiragana - ひらがな
          ・Phonetic script (~50)
          ・Used for inflections & particles
Romaji - ローマ字                   Katakana - カタカナ
・Latin characters (26+)         ・Phonetic script (~50)
・Used for proper nouns, etc.    ・Typically used for loan words



 JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字                      Hiragana - ひらがな
・Chinese characters (50,000+)   ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
! Words are implicit in Japanese - there
  is no white space that separates them
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
! We need to segment text into tokens first
! Two major approaches for segmentation

          1. n-gramming
          2. morphological analysis
            (statistical approach)
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR               Shall we go for a beer near JR Shinjuku station?
 n=2




JR
n-gramming (n=2)
J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                Shall we go for a beer near JR Shinjuku station?
 n=2
       R新




JR R新
n-gramming (n=2)
J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                     Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿




JR R新 新宿
n-gramming (n=2)
J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                      Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅




JR R新 新宿 宿駅
n-gramming (n=2)
J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                        Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の




JR R新 新宿 宿駅 駅の
n-gramming (n=2)
J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                             Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近




JR R新 新宿 宿駅 駅の の近
n-gramming (n=2)
J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                                  Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近


                                近く




JR R新 新宿 宿駅 駅の の近 近く
Problems with n-gramming
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×  ●
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
•   Still sometimes appropriate for certain search applications
     •   Compliance, e-commerce with special product names, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   CRFs decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, extract readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
How does this actually work?
Demo
Japanese support in
  Lucene and Solr
Japanese in Lucene/Solr
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics

! Customisable
How do we use it?
How do we use it?

      ! Use JapaneseAnalyzer
How do we use it?

      ! Use JapaneseAnalyzer



      ! Use field type “text_ja”
        in example schema.xml
Demo
Feature summary / text_ja analyzer chain
                       Segments Japanese text into tokens with very high accuracy
   JapaneseTokenizer   •   Token attributes for part-of-speech, base form, readings, etc.
                       •   Compound segmentation with compound synonyms
                       •   Segmentation is customisable using user dictionaries
Feature summary / text_ja analyzer chain
                         Segments Japanese text into tokens with very high accuracy
     JapaneseTokenizer    •   Token attributes for part-of-speech, base form, readings, etc.
                          •   Compound segmentation with compound synonyms
                          •   Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations

               LowerCaseFilter Lowercases
Feature details
Compound nouns
? How do we deal with compound nouns?
Compound nouns
? How do we deal with compound nouns?
      Japanese                English
    関西国際空港           Kansai International Airport
シニアソフトウェアエンジニア        Senior Software Engineer
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too
Compound segmentation

    関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
 Senior Software Engineer




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西
Kansai International Airport   Kansai
シニアソフトウェアエンジニナ                 シニア
 Senior Software Engineer      Senior




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際
Kansai International Airport   Kansai   International
シニアソフトウェアエンジニナ                 シニア      ソフトウェア
 Senior Software Engineer      Senior    Software




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際            空港
Kansai International Airport   Kansai   International   Airport
シニアソフトウェアエンジニナ                 シニア      ソフトウェア          エンジニナ
 Senior Software Engineer      Senior    Software       Engineer




 ! We are using a heuristic to implement this
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its part
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Character width normalisation
? How do we deal with character widths?
         Half-width・半角   Full-width・全角
            Lucene        Lucene
             カタカナ          カタカナ
             123           123
Character width normalisation
? How do we deal with character widths?
              Half-width・半角              Full-width・全角
                   Lucene                 Lucene
                    カタカナ                   カタカナ
                    123                    123


! Use CJKWidthFilter to normalise them
  (Unicode NFKC subset)



             Input text Lucene             カタカナ        123

        CJKWidthFilter      Lucene        カタカナ          123

                            half-width    full-width   half-width
Katakana end-vowel stemming
? A common spelling variation in
  katakana is a end long-vowel sound
   English   Japanese spelling variations
  manager    マネージャー            マネージャ        マネジャー
Katakana end-vowel stemming
  ? A common spelling variation in
    katakana is a end long-vowel sound
       English     Japanese spelling variations
       manager     マネージャー            マネージャ         マネジャー



   ! We JapaneseKatakanaStemFilter to
     normalise/stem end-vowel for long terms

                 Input text コピー     マネージャー        マネージャ      マネジャー
JapaneseKatakanaStemFilter コピー       マネージャ        マネージャ      マネジャ
                            copy       manager     manager   “manager”
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form


        買う
       kau
      to buy
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form   Inflected forms (not exhaustive)
                       買いなさい       買いませんでしたら   買える        買わせられる


        買う             買いなさるな
                       買いましたら
                                   買いませんでしたり
                                   買いませんなら
                                               買おう
                                               買った
                                                          買わせる
                                                          買わない
                       買いましたり      買うだろう       買ったら       買わないだろう


       kau             買いまして
                       買いましょう
                                   買うでしょう
                                   買うな
                                               買ったり
                                               買って
                                                          買わないで
                                                          買わないでしょう
                                               買わせない

      to buy
                       買います        買うまい                   買わなかった
                       買いますまい      買え          買わせます      買わなかったら
                       買いませば       買えない        買わせません     買わなかったり
                       買いません       買えば         買わせられない    買わなければ
                       買いませんで      買えます        買わせられます    買われない
                       買いませんでした    買えません       買わせられません   買われます
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form      Inflected forms (not exhaustive)
                           買いなさい      買いませんでしたら   買える        買わせられる


        買う                 買いなさるな
                           買いましたら
                                      買いませんでしたり
                                      買いませんなら
                                                  買おう
                                                  買った
                                                             買わせる
                                                             買わない
                           買いましたり     買うだろう       買ったら       買わないだろう


       kau                 買いまして
                           買いましょう
                                      買うでしょう
                                      買うな
                                                  買ったり
                                                  買って
                                                             買わないで
                                                             買わないでしょう
                                                  買わせない

      to buy
                           買います       買うまい                   買わなかった
                           買いますまい     買え          買わせます      買わなかったら
                           買いませば      買えない        買わせません     買わなかったり
                           買いません      買えば         買わせられない    買わなければ
                           買いませんで     買えます        買わせられます    買われない
                           買いませんでした   買えません       買わせられません   買われます




 ! Use JapaneseBaseformFilter to normalise
   inflected adjectives and verbs to dictionary form
   (lemmatisation by reduction)
User dictionaries
•   Own dictionaries can be used for ad hoc
    segmentation, i.e. to override default model
•   File format is simple and there’s no need to
    assign weights, etc. before using them
•   Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
Japanese focus in 4.0
•   Improvements in JapaneseTokenizer
     •   Improved search mode for katakana compounds
     •   Improved unknown word segmentation
     •   Some performance improvements
•   CharFilters for various character normalisations
     •   Dates and numbers
     •   Repetition marks (odoriji)
•   Japanese spell-checker
     •   Robert and Koji almost got this into 3.6, but it got
         postponed because of API changes being necessary
Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues
Q&A
ありがとうございました!
 arigatō gozaimashita!

Thank you very much!

Contenu connexe

Tendances

ドキュメントを作りたくなってしまう魔法のツール「Sphinx」
ドキュメントを作りたくなってしまう魔法のツール「Sphinx」ドキュメントを作りたくなってしまう魔法のツール「Sphinx」
ドキュメントを作りたくなってしまう魔法のツール「Sphinx」Yoshiki Shibukawa
 
JAMstackは眠らない
JAMstackは眠らないJAMstackは眠らない
JAMstackは眠らないKuniyoshi Tone
 
フィーチャモデルの描き方
フィーチャモデルの描き方フィーチャモデルの描き方
フィーチャモデルの描き方H Iseri
 
ScrapyとPhantomJSを用いたスクレイピングDSL
ScrapyとPhantomJSを用いたスクレイピングDSLScrapyとPhantomJSを用いたスクレイピングDSL
ScrapyとPhantomJSを用いたスクレイピングDSLMasayuki Isobe
 
Java ORマッパー選定のポイント #jsug
Java ORマッパー選定のポイント #jsugJava ORマッパー選定のポイント #jsug
Java ORマッパー選定のポイント #jsugMasatoshi Tada
 
CoreDataでのsubqueryの使い方
CoreDataでのsubqueryの使い方CoreDataでのsubqueryの使い方
CoreDataでのsubqueryの使い方Masaru Ichikawa
 
「自分のとこでは動くけど…」を無くす devcontainer
「自分のとこでは動くけど…」を無くす devcontainer「自分のとこでは動くけど…」を無くす devcontainer
「自分のとこでは動くけど…」を無くす devcontainerYuta Matsumura
 
AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Inc.
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかAtsushi Nakada
 
機械学習を活用したテスト自動化システムの設計
機械学習を活用したテスト自動化システムの設計機械学習を活用したテスト自動化システムの設計
機械学習を活用したテスト自動化システムの設計Nozomi Ito
 
【初心者向け】Go言語勉強会資料
 【初心者向け】Go言語勉強会資料 【初心者向け】Go言語勉強会資料
【初心者向け】Go言語勉強会資料Yuji Otani
 
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3 データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3 Hiroshi Ito
 
軟體架構設計的技術養成之路
軟體架構設計的技術養成之路軟體架構設計的技術養成之路
軟體架構設計的技術養成之路Gelis Wu
 
プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法Takuya Akiba
 
Pythonはどうやってlen関数で長さを手にいれているの?
Pythonはどうやってlen関数で長さを手にいれているの?Pythonはどうやってlen関数で長さを手にいれているの?
Pythonはどうやってlen関数で長さを手にいれているの?Takayuki Shimizukawa
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)AtCoder Inc.
 

Tendances (20)

ドキュメントを作りたくなってしまう魔法のツール「Sphinx」
ドキュメントを作りたくなってしまう魔法のツール「Sphinx」ドキュメントを作りたくなってしまう魔法のツール「Sphinx」
ドキュメントを作りたくなってしまう魔法のツール「Sphinx」
 
JAMstackは眠らない
JAMstackは眠らないJAMstackは眠らない
JAMstackは眠らない
 
フィーチャモデルの描き方
フィーチャモデルの描き方フィーチャモデルの描き方
フィーチャモデルの描き方
 
ScrapyとPhantomJSを用いたスクレイピングDSL
ScrapyとPhantomJSを用いたスクレイピングDSLScrapyとPhantomJSを用いたスクレイピングDSL
ScrapyとPhantomJSを用いたスクレイピングDSL
 
Java ORマッパー選定のポイント #jsug
Java ORマッパー選定のポイント #jsugJava ORマッパー選定のポイント #jsug
Java ORマッパー選定のポイント #jsug
 
iOSでMVVM入門
iOSでMVVM入門iOSでMVVM入門
iOSでMVVM入門
 
CoreDataでのsubqueryの使い方
CoreDataでのsubqueryの使い方CoreDataでのsubqueryの使い方
CoreDataでのsubqueryの使い方
 
Tackling Complexity
Tackling ComplexityTackling Complexity
Tackling Complexity
 
「自分のとこでは動くけど…」を無くす devcontainer
「自分のとこでは動くけど…」を無くす devcontainer「自分のとこでは動くけど…」を無くす devcontainer
「自分のとこでは動くけど…」を無くす devcontainer
 
計算量
計算量計算量
計算量
 
AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説AtCoder Beginner Contest 035 解説
AtCoder Beginner Contest 035 解説
 
シリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのかシリコンバレーの「何が」凄いのか
シリコンバレーの「何が」凄いのか
 
機械学習を活用したテスト自動化システムの設計
機械学習を活用したテスト自動化システムの設計機械学習を活用したテスト自動化システムの設計
機械学習を活用したテスト自動化システムの設計
 
論文の書き方入門 2017
論文の書き方入門 2017論文の書き方入門 2017
論文の書き方入門 2017
 
【初心者向け】Go言語勉強会資料
 【初心者向け】Go言語勉強会資料 【初心者向け】Go言語勉強会資料
【初心者向け】Go言語勉強会資料
 
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3 データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3
データ履歴管理のためのテンポラルデータモデルとReladomoの紹介 #jjug_ccc #ccc_g3
 
軟體架構設計的技術養成之路
軟體架構設計的技術養成之路軟體架構設計的技術養成之路
軟體架構設計的技術養成之路
 
プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法プログラミングコンテストでの動的計画法
プログラミングコンテストでの動的計画法
 
Pythonはどうやってlen関数で長さを手にいれているの?
Pythonはどうやってlen関数で長さを手にいれているの?Pythonはどうやってlen関数で長さを手にいれているの?
Pythonはどうやってlen関数で長さを手にいれているの?
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)
 

En vedette

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介Toshinori Sato
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4Masato Nakai
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemlucenerevolution
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Koki Shibata
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 

En vedette (7)

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Japanese Linguistics in Lucene and Solr

  • 1. Japanese linguistics in Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  • 2. About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on christian@atilika.com or cm@apache.org
  • 4. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 5. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 6. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 15. JR新宿駅の近くにビールを飲みに行こうか? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  • 17. Romaji - ローマ字 ・Latin characters (26+) ・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  • 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?
  • 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  • 21. Romaji - ローマ字 Katakana - カタカナ ・Latin characters (26+) ・Phonetic script (~50) ・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか? Kanji - 漢字 Hiragana - ひらがな ・Chinese characters (50,000+) ・Phonetic script (~50) ・Used for stems & proper nouns ・Used for inflections & particles
  • 24. JR新宿駅の近くにビールを飲みに行こうか? ? What are the words in this sentence? ! Words are implicit in Japanese - there is no white space that separates them
  • 26. JR新宿駅の近くにビールを飲みに行こうか? ? How do we index this for search, then? ! We need to segment text into tokens first
  • 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  • 28. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 29. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 JR
  • 30. n-gramming (n=2) J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 JR R新
  • 31. n-gramming (n=2) J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 JR R新 新宿
  • 32. n-gramming (n=2) J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 JR R新 新宿 宿駅
  • 33. n-gramming (n=2) J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の JR R新 新宿 宿駅 駅の
  • 34. n-gramming (n=2) J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 JR R新 新宿 宿駅 駅の の近
  • 35. n-gramming (n=2) J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近く JR R新 新宿 宿駅 駅の の近 近く
  • 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  • 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  • 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  • 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  • 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  • 48. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 49. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  • 50. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 51. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 52. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 53. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 54. How does this actually work?
  • 55. Demo
  • 56. Japanese support in Lucene and Solr
  • 58. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6
  • 59. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box
  • 60. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults
  • 61. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics
  • 62. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Customisable
  • 63. How do we use it?
  • 64. How do we use it? ! Use JapaneseAnalyzer
  • 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  • 66. Demo
  • 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  • 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  • 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  • 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  • 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  • 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  • 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  • 75. Compound nouns ? How do we deal with compound nouns?
  • 76. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer
  • 77. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  • 78. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match ! We need to segment the compounds, too
  • 79. Compound segmentation 関西国際空港 Kansai International Airport シニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  • 80. Compound segmentation 関西国際空港 関西 Kansai International Airport Kansai シニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  • 81. Compound segmentation 関西国際空港 関西 国際 Kansai International Airport Kansai International シニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  • 82. Compound segmentation 関西国際空港 関西 国際 空港 Kansai International Airport Kansai International Airport シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  • 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 87. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  • 88. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123 ! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  • 89. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  • 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャー JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  • 91. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  • 92. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  • 93. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  • 94. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  • 95. User dictionaries • Own dictionaries can be used for ad hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 # Custom reading and POS former sumo wrestler Asashoryu 朝青龍,朝青龍,アサショウリュウ,カスタム人名
  • 96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  • 97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds! Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  • 98. Q&A