Answer a question

I am trying to remove \n and \t that show up in data scraped from a webpage.

I have used the strip() function, however it doesn't seem to work for some reason. My output still shows up with all the \ns and \ts.

Here's my code :

import urllib.request
from bs4 import BeautifulSoup
import sys

all_comments = [] 
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'

while next_page and len(all_comments) < max_comments : 
    response = response = urllib.request.urlopen(next_page)
    srcode = response.read()
    soup = BeautifulSoup(srcode, "html.parser")

    all_comments_div=soup.find_all('div', class_="comment_body"); 
    for div in all_comments_div:
        data = div.find('p').text
        data = data.strip(' \t\n')#actual comment content
        data=''.join([ i for i in data if ord(i) < 128 ])
        all_comments.append(data)

    #getting the link of the stream for more comments
    next_page = soup.find('li', class_='pager-next first last')
    if next_page : 
        next_page = base_url + next_page.find('a').get('href')
    print('comments: {}'.format(len(all_comments)))

print(all_comments)

And here's the output I'm getting:

comments: 10

["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']

Answers

Use replace instead of strip:

div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)

Result:

blablabla  blablabla

Some explanation: Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.

Some examples:

div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)

Result:

blablabla blablabla

Character anywhere between start and end will not be removed:

div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)

result:

blablabla /n blablabla
Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐