LANGUAGE BASED WEB CRAWLING ON BIG DATA

Girgin, Canan; Gonultas, Hayati; Pembe Muhtaroglu, F. Canan; Demir, Seniz; Akin, Ahmet A.; Obali, Murat

doi:10.81043/aperta.99293

Published January 1, 2014 | Version v1

Conference paper Open

LANGUAGE BASED WEB CRAWLING ON BIG DATA

1. TUBITAK BILGEM, Bilisim Teknol Enstitusu, Ankara, Turkey

Online textual and visual data that are created and used by web users have been increasing dramatically and continually. This increase has caused the need for easy and fast access to online data and facilitated the development of alternative means of access to this data. Nowadays, web crawlers are the most efficient and popular tools used for accessing big volumes of data available on the web. In this paper, a web crawler which works on a distributed Hadoop cluster for crawling web pages with content of a predefined language is described. A language identification tool is developed for enabling the system to focus only on a specific language. In this study, the accuracy of the language identification tool is evaluated on a small data set (consisting of 4729 web pages). The performance of the focused web crawling system is reported on a big data set of 86 million web pages containing Turkish content.

Files

bib-98e9c163-e8c0-4d43-95c2-3241e315caeb.txt

Files (205 Bytes)

Name	Size	Download all
bib-98e9c163-e8c0-4d43-95c2-3241e315caeb.txt md5:e71a362c014c99103f6c4b40442140fe	205 Bytes	Preview Download

	All versions	This version
Views	38	38
Downloads	10	10
Data volume	2.0 kB	2.0 kB

LANGUAGE BASED WEB CRAWLING ON BIG DATA

Files

bib-98e9c163-e8c0-4d43-95c2-3241e315caeb.txt

Files (205 Bytes)

TÜBİTAK ULAKBİM

CONTACT

LANGUAGE BASED WEB CRAWLING ON BIG DATA

Creators

Description

Files

bib-98e9c163-e8c0-4d43-95c2-3241e315caeb.txt

Files (205 Bytes)