Konferans bildirisi Açık Erişim
Kulekci, M. Oguzhan; Habib, Ismail; Aghabaiglou, Amir
<?xml version='1.0' encoding='UTF-8'?> <record xmlns="http://www.loc.gov/MARC21/slim"> <leader>00000nam##2200000uu#4500</leader> <datafield tag="245" ind1=" " ind2=" "> <subfield code="a">Privacy-Preserving Text Similarity via Non-Prefix-Free Codes</subfield> </datafield> <datafield tag="024" ind1=" " ind2=" "> <subfield code="a">10.1007/978-3-030-32047-8_9</subfield> <subfield code="2">doi</subfield> </datafield> <controlfield tag="001">73665</controlfield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">user-tubitak-destekli-proje-yayinlari</subfield> </datafield> <datafield tag="520" ind1=" " ind2=" "> <subfield code="a">Many methods have been proposed to compute the similarity score alpha &lt;- S(A, B) in between two plain documents A and B. However, when their contents are confidential, special processing is required to protect privacy. A great extent of the solutions offered to date is mostly based on homomorphic encryption or secure multi-party computation techniques, where their computational cost inhibits the practical usage, especially on massive sets. In this study we propose an alternative by encoding the documents with non-prefix-free (NPF) coding before applying the preferred similarity metric S(). The NPF coding simply represents the symbols with variable-length codewords, where the codeword set is generated without the prefix-free restriction. Thus, a codeword may be a prefix of another, and without the explicit codeword boundary information, retrieving the original data from the encoded stream becomes hard due to the lack of unique decodability in non-prefix-free codes. We provide the combinatorial analysis of this hardness, and experimentally compare the similarity scores obtained on NPF encoded documents and on original plain text versions. We have considered normalized compression distance (NCD) and Jaccard coefficient (JC) for the similarity metric S(). When A' and B' denote the NPF-encoded documents, experiments conducted on METER corpus revealed that the difference between alpha' &lt;- S(A', B') and alpha &lt;- S(A, B) lie in the range of 0.5% and 3% for both NCD and JC.</subfield> </datafield> <datafield tag="650" ind1="1" ind2="7"> <subfield code="2">opendefinition.org</subfield> <subfield code="a">cc-by</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">Istanbul Tech Univ, Inst Informat, Istanbul, Turkey</subfield> <subfield code="a">Habib, Ismail</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">Istanbul Tech Univ, Inst Informat, Istanbul, Turkey</subfield> <subfield code="a">Aghabaiglou, Amir</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="b">conferencepaper</subfield> <subfield code="a">publication</subfield> </datafield> <datafield tag="542" ind1=" " ind2=" "> <subfield code="l">open</subfield> </datafield> <datafield tag="100" ind1=" " ind2=" "> <subfield code="u">Istanbul Tech Univ, Inst Informat, Istanbul, Turkey</subfield> <subfield code="a">Kulekci, M. Oguzhan</subfield> </datafield> <datafield tag="711" ind1=" " ind2=" "> <subfield code="a">SIMILARITY SEARCH AND APPLICATIONS (SISAP 2019)</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="c">2019-01-01</subfield> </datafield> <controlfield tag="005">20210316040049.0</controlfield> <datafield tag="909" ind1="C" ind2="O"> <subfield code="o">oai:zenodo.org:73665</subfield> <subfield code="p">user-tubitak-destekli-proje-yayinlari</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="z">md5:0d4dbdf24b76fcea197842f4f5d1f6bb</subfield> <subfield code="s">158</subfield> <subfield code="u">https://aperta.ulakbim.gov.trrecord/73665/files/bib-df864d4e-953d-427a-bbdf-6decfac55346.txt</subfield> </datafield> <datafield tag="540" ind1=" " ind2=" "> <subfield code="u">http://www.opendefinition.org/licenses/cc-by</subfield> <subfield code="a">Creative Commons Attribution</subfield> </datafield> </record>
Görüntülenme | 48 |
İndirme | 5 |
Veri hacmi | 790 Bytes |
Tekil görüntülenme | 48 |
Tekil indirme | 5 |