[6144]基于机器学习的新闻网页识别方法研究与实现外文翻译资料

 2021-12-06 09:12

-The World Wide Web (WWW) is a collection of billions of documents formatted using HTML.Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called “Crawlers” or “Spiders”. A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.

I INTRODUCTION

The World Wide Web (or the Web) is a collection of billions of interlinked documents formatted using HTML. WWW is a network where we can get a large amount of information. In a Web, a user views the Web pages that contains text, images, and other multimedia and navigates between them using hyperlinks. By Search Engine, we are usually referring to the actual search that we are performing through the databases of HTML documents. When you ask a search engine to get the desired information, it is actually searches through the index which it has created and does not actually searches through the Web. Different search engines give different ranking results because not every search engine uses the same algorithm to search through all the indices. The question is what is going on behind these search engines and why is it possible to get relevant data so fast?

The answer is web crawlers. Crawlers form the crucial component of search engine with a primary job of traversing the web amp; retrieving web pages to populate the database for later indexing and ranking .More specifically the crawler iteratively performs the following process:

1. Download the Web page.

2. Parse through the downloaded page and retrieve all the links.

3. For each link retrieved, repeat the process.

The information can be used to collect more on related data by intelligently and efficiently choosing what links to follow and what pages to discard. This process is called Focused Crawling [1]. A focused crawler tries to identify the most promising links, and ignores off-topic documents. If the crawler starts from a document which is i steps from a target document, it downloads a small subset of all the documents that are up to i-1 steps from the starting document. If the search strategy is optimal the crawler takes only i steps to discover the target. In order to achieve topic specialization of high quality, the logic behind focused crawling tries to imitate the human behavior when searching for a specific topic. The crawler takes following features into account of the web that can be used for topic discrimination including:

bull; the relevance of the parent page to the subject

bull; the importance of a page

bull; the structure of the web

bull; features of the link and the text around a link

bull; the experience of the crawler

The remainder of the paper is structured as follows. The next section surveys related work. Section 3 describes the methodology of focus crawler that we used. Section 4 describes Pattern recognition and the algorithms that we used. Section 4amp;5 implementation and experimental results. Section 6 conclusion and future work.

II. LITERATURE SURVEY

The first generations of crawlers [22] on which most of the web search engines are based rely heavily on traditional graph algorithms, such as breadth-first or depth-first traversal, to index the web. A core set of URLs are used as a seed set, and the algorithm recursively follows hyper links down to other documents. Document content is paid little heed, since the ultimate goal of the crawl is to cover the whole web. However, at the time, the web was two to three orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today#39;s web. Depth-first crawling [22] follows each possible path to its conclusion before another path is tried. It works by finding the first link on the first page. It then crawls the page associated with that link, finding the first link on the new page, and so on, until the end of the path has been reached. The process continues until all the branches of all the links have been exhausted. Breadth-first crawling [2] checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first pagersquo;s first link, and so on, until each level of links has been exhausted. Fish-Search [3] the Web is crawled by a team of crawlers, which are viewed as a school of fish. If the lsquo;lsquo;fishrsquo;rsquo; finds a relevant page based on keywords specified in the query, it continues looking by following more links from that page. If the page is not relevant, its child links receive a low preferential value. Shark-Search [4] is a modification of Fish-search which differs i

原文:

-The World Wide Web (WWW) is a collection of billions of documents formatted using HTML.Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called “Crawlers” or “Spiders”. A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.

I INTRODUCTION

The World Wide Web (or the Web) is a collection of billions of interlinked documents formatted using HTML. WWW is a network where we can get a large amount of information. In a Web, a user views the Web pages that contains text, images, and other multimedia and navigates between them using hyperlinks. By Search Engine, we are usually referring to the actual search that we are performing through the databases of HTML documents. When you ask a search engine to get the desired information, it is actually searches through the index which it has created and does not actually searches through the Web. Different search engines give different ranking results because not every search engine uses the same algorithm to search through all the indices. The question is what is going on behind these search engines and why is it possible to get relevant data so fast?

The answer is web crawlers. Crawlers form the crucial component of search engine with a primary job of traversing the web amp; retrieving web pages to populate the database for later indexing and ranking .More specifically the crawler iteratively performs the following process:

1. Download the Web page.

2. Parse through the downloaded page and retrieve all the links.

3. For each link retrieved, repeat the process.

The information can be used to collect more on related data by intelligently and efficiently choosing what links to follow and what pages to discard. This process is called Focused Crawling [1]. A focused crawler tries to identify the most promising links, and ignores off-topic documents. If the crawler starts from a document which is i steps from a target document, it downloads a small subset of all the documents that are up to i-1 steps from the starting document. If the search strategy is optimal the crawler takes only i steps to discover the target. In order to achieve topic specialization of high quality, the logic behind focused crawling tries to imitate the human behavior when searching for a specific topic. The crawler takes following features into account of the web that can be used for topic discrimination including:

bull; the relevance of the parent page to the subject

bull; the importance of a page

bull; the structure of the web

bull; features of the link and the text around a link

bull; the experience of the crawler

The remainder of the paper is structured as follows. The next section surveys related work. Section 3 describes the methodology of focus crawler that we used. Section 4 describes Pattern recognition and the algorithms that we used. Section 4amp;5 implementation and experimental results. Section 6 conclusion and future work.

II. LITERATURE SURVEY

The first generations of crawlers [22] on which most of the web search engines are based rely heavily on traditional graph algorithms, such as breadth-first or depth-first traversal, to index the web. A core set of URLs are used as a seed set, and the algorithm recursively follows hyper links down to other documents. Document content is paid little heed, since the ultimate goal of the crawl is to cover the whole web. However, at the time, the web was two to three orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today#39;s web. Depth-first crawling [22] follows each possible path to its conclusion before another path is tried. It works by finding the first link on the first page. It then crawls the page associated with that link, finding the first link on the new page, and so on, until the end of the path has been reached. The process continues until all the branches of all the links have been exhausted. Breadth-first crawling [2] checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first pagersquo;s first link, and so on, until each level of links has been exhausted. Fish-Search [3] the Web is crawled by a team of crawlers, which are viewed as a school of fish. If the lsquo;lsquo;fishrsquo;rsquo; finds a relevant page based on keywords specified in the query, it continues looking by following more links from that page. If the page is not relevant, its child links receive a low preferential value. Shark-Search [4] is a modification of Fish-search which

原文和译文剩余内容已隐藏,您需要先支付 20元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。