Web mining of firm websites: A framework for web scraping and a pilot study for Germany
Jan Kinne () and
No 18-033, ZEW Discussion Papers from ZEW - Leibniz Centre for European Economic Research
Nowadays, almost all (relevant) firms have their own websites which they use to publish information about their products and services. Using the example of innovation in firms, we outline a framework for extracting information from firm websites using web scraping and data mining. For this purpose, we present an easy and free-to-use web scraping tool for large-scale data retrieval from firm websites. We apply this tool in a large-scale pilot study to provide information on the data source (i.e. the population of firm websites in Germany), which has as yet not been studied rigorously in terms of its qualitative and quantitative properties. We find, inter alia, that the use of websites and websites' characteristics (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. Web-based studies also have to contend with distinct outliers and the fact that low broadband availability appears to prevent firms from operating a website. Finally, we propose two approaches based on neural network language models and social network analysis to derive firm-level information from the extracted web data.
Keywords: Web Mining; Web Scraping; R&D; R&I; STI; Innovation; Indicators; Text Mining (search for similar items in EconPapers)
JEL-codes: O30 C81 C88 (search for similar items in EconPapers)
New Economics Papers: this item is included in nep-big, nep-cmp and nep-ict
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2) Track citations by RSS feed
Downloads: (external link)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:zbw:zewdip:18033
Access Statistics for this paper
More papers in ZEW Discussion Papers from ZEW - Leibniz Centre for European Economic Research Contact information at EDIRC.
Bibliographic data for series maintained by ZBW - Leibniz Information Centre for Economics ().