Answer a question

I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would. What am I doing wrong?

Here's my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re

url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[] 
wanted = ['tender','2020','date']    
for word in wanted:
    a=requests.get(url).text.count(word)
    dic={'phrase':word,
          'frequency':a,              
            }          
    fr.append(dic)  
    print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)    

Answers

Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).

If you want to go about this with selenium, you could try:

from selenium import webdriver

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'

driver = webdriver.Chrome(chromedriver_location)
driver.get(url)
body = driver.find_element_by_tag_name('body')

fr = [] 
wanted = ['tender', '2020', 'date']    
for word in wanted:
    freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

which gave me the same results that a CTRL + F does.

You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:

import requests
from bs4 import BeautifulSoup

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr = [] 
wanted = ['tender','2020','date']    
a = requests.get(url).text
soup = BeautifulSoup(a, 'html.parser')
for word in wanted:
    freq = soup.get_text().lower().count(word)
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.

Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐