{"id":6025,"date":"2022-08-18T15:07:43","date_gmt":"2022-08-18T06:07:43","guid":{"rendered":"https:\/\/www.skyer9.pe.kr\/wordpress\/?p=6025"},"modified":"2022-08-19T14:57:36","modified_gmt":"2022-08-19T05:57:36","slug":"elasticsearch-text-analysis","status":"publish","type":"post","link":"https:\/\/www.skyer9.pe.kr\/wordpress\/?p=6025","title":{"rendered":"Elasticsearch &#8211; Text Analysis"},"content":{"rendered":"<h1>Elasticsearch &#8211; Text Analysis<\/h1>\n<h2>3\ub2e8\uacc4<\/h2>\n<p>Elasticsearch\uc758 \uc560\ub110\ub77c\uc774\uc800\ub294,<br \/>\n0 ~ 3 \uac1c\uc758 Character Filter,<br \/>\n1\uac1c\uc758 Tokenizer,<br \/>\n\uadf8\ub9ac\uace0 0 ~ n \uac1c\uc758 Token Filter \ub85c \uc774\ub8e8\uc5b4\uc9d1\ub2c8\ub2e4.<\/p>\n<h2>Character Filter<\/h2>\n<p>Data type text \uac00 \uc785\ub825\ub418\uba74 \uac00\uc7a5 \uba3c\uc800 0 ~ 3 \uac1c\uc758 Character Filter \uac00 \uc801\uc6a9\ub429\ub2c8\ub2e4.<\/p>\n<p>\uc774 \ub2e8\uacc4\uc5d0\uc11c\ub294 text \uc5d0\uc11c \ud2b9\uc815 \ubb38\uc790\ub97c replace \ub610\ub294 \uc81c\uac70\ud558\uac8c \ub429\ub2c8\ub2e4.<\/p>\n<h3>Character Filter \uc885\ub958<\/h3>\n<ul>\n<li>\n<p>HTML Strip Character Filter<\/p>\n<p>html \uc744 \uc81c\uac70\ud569\ub2c8\ub2e4.<\/p>\n<p>GET \/_analyze<\/p>\n<pre><code class=\"language-json\">{\n &quot;tokenizer&quot;: &quot;keyword&quot;,\n &quot;char_filter&quot;: [\n   &quot;html_strip&quot;\n ],\n &quot;text&quot;: &quot;&lt;p&gt;I&amp;apos;m so &lt;b&gt;happy&lt;\/b&gt;!&lt;\/p&gt;&quot;\n}<\/code><\/pre>\n<p>response<\/p>\n<pre><code class=\"language-json\">{\n &quot;tokens&quot;: [\n   {\n     &quot;token&quot;: &quot; I&#039;m so happy! &quot;,\n     &quot;start_offset&quot;: 0,\n     &quot;end_offset&quot;: 32,\n     &quot;type&quot;: &quot;word&quot;,\n     &quot;position&quot;: 0\n   }\n ]\n}<\/code><\/pre>\n<\/li>\n<li>\n<p>Mapping Character Filter<\/p>\n<p>\ubb38\uc790 \ub610\ub294 \ubb38\uc790\uc5f4\uc744 \ub2e4\ub978 \ubb38\uc790 \ub610\ub294 \ubb38\uc790\uc5f4\ub85c \ubcc0\ud658\ud574 \uc90d\ub2c8\ub2e4.<\/p>\n<ul>\n<li><code>mappings<\/code> : \ubc30\uc5f4\uc744 \ub370\uc774\ud0c0\ub85c \ud569\ub2c8\ub2e4.<\/li>\n<li><code>mappings_path<\/code> : \ud30c\uc77c\uc5d0 \ub9e4\ud551\uc815\ubcf4\ub97c \uc800\uc7a5\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4.<\/li>\n<\/ul>\n<p>GET \/_analyze<\/p>\n<pre><code class=\"language-json\">{\n &quot;tokenizer&quot;: &quot;keyword&quot;,\n &quot;char_filter&quot;: [\n   {\n     &quot;type&quot;: &quot;mapping&quot;,\n     &quot;mappings&quot;: [\n       &quot;c++ =&gt; c_plus_plus&quot;\n     ]\n   }\n ],\n &quot;text&quot;: &quot;c++&quot;\n}<\/code><\/pre>\n<p>response<\/p>\n<pre><code class=\"language-json\">{\n &quot;tokens&quot;: [\n   {\n     &quot;token&quot;: &quot;c_plus_plus&quot;,\n     &quot;start_offset&quot;: 0,\n     &quot;end_offset&quot;: 3,\n     &quot;type&quot;: &quot;word&quot;,\n     &quot;position&quot;: 0\n   }\n ]\n}<\/code><\/pre>\n<\/li>\n<li>\n<p>Pattern Replace Character Filter<\/p>\n<p>Java \uc815\uaddc\uc2dd\uc744 \uc774\uc6a9\ud55c \ubcc0\ud658\uc744 \ud569\ub2c8\ub2e4.<\/p>\n<\/li>\n<\/ul>\n<h2>Tokenizer<\/h2>\n<p>Character Filter \ub97c \ud1b5\uacfc\ud55c text \ub294 1\uac1c\uc758 Tokenizer \uc5d0 \uc758\ud574,<br \/>\nN \uac1c\uc758 \ub2e8\uc5b4\ub85c \ubd84\ub9ac\ub429\ub2c8\ub2e4.<\/p>\n<p>\uc608\ub97c \ub4e4\uc5b4, whitespace Tokenizer \ub294 \uacf5\ubc31\ubb38\uc790\ub97c \uae30\uc900\uc73c\ub85c \ub2e8\uc5b4\ub97c \ucd94\ucd9c\ud574 \ub0c5\ub2c8\ub2e4.<\/p>\n<h3>Tokenizer \uc885\ub958<\/h3>\n<h4>\ub2e8\uc5b4 \uae30\ubc18<\/h4>\n<ul>\n<li>\n<p>Standard Tokenizer<\/p>\n<p>whitespace Tokenizer \uc640 \uc720\uc0ac\ud558\uc9c0\ub9cc, \ub9c8\uce68\ud45c \ubb3c\uc74c\ud45c, \ub9c8\uc774\ub108\uc2a4 \ub4f1\uc744 \uc81c\uac70\ud569\ub2c8\ub2e4.<\/p>\n<p>POST _analyze<\/p>\n<pre><code class=\"language-json\">{\n &quot;tokenizer&quot;: &quot;standard&quot;,\n &quot;text&quot;: &quot;The 2 QUICK Brown-Foxes jumped over the lazy dog&#039;s bone.&quot;\n}<\/code><\/pre>\n<\/li>\n<li>\n<p>Whitespace Tokenizer<\/p>\n<p>\uacf5\ubc31\ubb38\uc790\ub97c \uae30\uc900\uc73c\ub85c \ub2e8\uc5b4\ub97c \ucd94\ucd9c\ud569\ub2c8\ub2e4.<br \/>\n\ub9c8\uce68\ud45c \ubb3c\uc74c\ud45c \ub4f1\uc744 \uc81c\uac70\ud558\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4.<\/p>\n<\/li>\n<li>\n<p>Classic Tokenizer<\/p>\n<p>\uc601\uc5b4\uc5d0 \ud2b9\ud654\ub41c Tokenizer \uc785\ub2c8\ub2e4.<br \/>\n\uc774\uba54\uc77c\uc8fc\uc18c, \ub3c4\uba54\uc778, \uc804\ud654\ubc88\ud638 \ub4f1\uc744 \uc778\uc2dd\ud569\ub2c8\ub2e4.<\/p>\n<\/li>\n<\/ul>\n<h4>\ub2e8\uc5b4 \uc870\uac01\ud654<\/h4>\n<ul>\n<li>\n<p>N-Gram Tokenizer<\/p>\n<p>sliding window \ubc29\uc2dd\uc73c\ub85c \ub2e8\uc5b4\ub97c \uc870\uac01\ud654 \ud569\ub2c8\ub2e4.<br \/>\nsliding window \ub780 \uc624\ub978\ucabd\uc73c\ub85c \uc774\ub3d9\ud558\uba74\uc11c,<br \/>\n\uc77c\uc815\ud55c \uae38\uc774\ub9cc\ud07c \ub2e8\uc5b4\ub97c \uc870\uac01\ub0b4\ub294 \ubc29\uc2dd\uc785\ub2c8\ub2e4.<\/p>\n<p>\ub744\uc5b4\uc4f0\uae30 \uc548\ub41c \ud55c\uae00\uc744 \ud1a0\ud06c\ub098\uc774\uc9d5 \ud558\uae30\uc5d0 \uc801\ud569\ud569\ub2c8\ub2e4.<\/p>\n<p><code>quick \u2192 [qu, ui, ic, ck]<\/code><\/p>\n<p><code>token_chars<\/code> \uc5d0\ub294 <code>letter<\/code>, <code>digit<\/code>, <code>whitespace<\/code>, <code>punctuation<\/code>, <code>symbol<\/code>, <code>custom<\/code> \uc774 \ud3ec\ud568\ub420 \uc218 \uc788\uc2b5\ub2c8\ub2e4.<br \/>\n<code>custom<\/code> \uc744 \uc4f0\uae30 \uc704\ud574\uc11c\ub294 <code>custom_token_chars<\/code> \uac00 \uc138\ud305\ub418\uc5b4 \uc788\uc5b4\uc57c \ud569\ub2c8\ub2e4.<\/p>\n<p>PUT my_index<\/p>\n<pre><code class=\"language-json\">{\n &quot;settings&quot;: {\n   &quot;max_ngram_diff&quot;: 20,\n   &quot;analysis&quot;: {\n     &quot;analyzer&quot;: {\n       &quot;my_analyzer&quot;: {\n         &quot;tokenizer&quot;: &quot;my_tokenizer&quot;\n       }\n     },\n     &quot;tokenizer&quot;: {\n       &quot;my_tokenizer&quot;: {\n         &quot;type&quot;: &quot;ngram&quot;,\n         &quot;custom_token_chars&quot;: &quot;+-_&#039;&quot;,\n         &quot;min_gram&quot;: 1,\n         &quot;max_gram&quot;: 20,\n         &quot;token_chars&quot;: [\n           &quot;letter&quot;,\n           &quot;digit&quot;,\n           &quot;custom&quot;\n         ]\n       }\n     }\n   }\n }\n}<\/code><\/pre>\n<p>POST my_index\/_analyze<\/p>\n<pre><code class=\"language-json\">{\n &quot;analyzer&quot;: &quot;my_analyzer&quot;,\n &quot;text&quot;: &quot;1800\uc0ac\ubb34\uc6a9\ucc45\uc0c1&quot;\n}<\/code><\/pre>\n<\/li>\n<li>\n<p>Edge N-Gram Tokenizer<\/p>\n<p>\uccab\uae00\uc790\ub97c \uc2dc\uc791\uc810\uc73c\ub85c \uae00\uc790\uc218\ub97c \ub298\ub824\uac00\uba74\uc11c \ub2e8\uc5b4\ub97c \uc870\uac01\ub0c5\ub2c8\ub2e4.<\/p>\n<p><code>quick \u2192 [q, qu, qui, quic, quick]<\/code><\/p>\n<\/li>\n<\/ul>\n<h4>\uad6c\uc870\ud654\ub41c \ub2e8\uc5b4<\/h4>\n<p>&#8230;&#8230;<\/p>\n<h2>Token Filter<\/h2>\n<p>Tokenizer \ub97c \uac70\uce5c \ub2e8\uc5b4\ub4e4\uc740 Token Filter \uc5d0 \uc758\ud574 \uc815\uc81c\ub429\ub2c8\ub2e4.<\/p>\n<p>\uc608\ub97c \ub4e4\uc5b4, lowercase \ub294 \ubaa8\ub4e0 \ub2e8\uc5b4\ub97c \uc18c\ubb38\uc790\ub85c \ubcc0\uacbd\ud569\ub2c8\ub2e4.<\/p>\n<h3>Token Filter \uc885\ub958<\/h3>\n<p><a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-tokenfilters.html\">\ucc38\uc870<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Elasticsearch &#8211; Text Analysis 3\ub2e8\uacc4 Elasticsearch\uc758 \uc560\ub110\ub77c\uc774\uc800\ub294, 0 ~ 3 \uac1c\uc758 Character Filter, 1\uac1c\uc758 Tokenizer, \uadf8\ub9ac\uace0 0 ~ n \uac1c\uc758 Token Filter \ub85c \uc774\ub8e8\uc5b4\uc9d1\ub2c8\ub2e4. Character Filter Data type text \uac00 \uc785\ub825\ub418\uba74 \uac00\uc7a5 \uba3c\uc800 0 ~ 3 \uac1c\uc758 Character Filter \uac00 \uc801\uc6a9\ub429\ub2c8\ub2e4. \uc774 \ub2e8\uacc4\uc5d0\uc11c\ub294 text \uc5d0\uc11c \ud2b9\uc815 \ubb38\uc790\ub97c replace \ub610\ub294 \uc81c\uac70\ud558\uac8c \ub429\ub2c8\ub2e4. Character Filter \uc885\ub958 HTML\u2026 <span class=\"read-more\"><a href=\"https:\/\/www.skyer9.pe.kr\/wordpress\/?p=6025\">Read More &raquo;<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-6025","post","type-post","status-publish","format-standard","hentry","category-elasticsearch"],"_links":{"self":[{"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/6025","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6025"}],"version-history":[{"count":14,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/6025\/revisions"}],"predecessor-version":[{"id":6051,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/6025\/revisions\/6051"}],"wp:attachment":[{"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6025"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6025"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.skyer9.pe.kr\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6025"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}