EconPapers    
Economics at your fingertips  
 

Text characteristics of English language university Web sites

Mike Thelwall

Journal of the American Society for Information Science and Technology, 2005, vol. 56, issue 6, 609-619

Abstract: The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic‐specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three English‐speaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.

Date: 2005
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/asi.20126

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:56:y:2005:i:6:p:609-619

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890

Access Statistics for this article

More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamist:v:56:y:2005:i:6:p:609-619