Answer a question

I created a dataframe that has the following schema:

In [43]: yelp_df.printSchema()
root
 |-- business_id: string (nullable = true)
 |-- cool: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- stars: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhoods: string (nullable = true)
 |-- open: boolean (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- state: string (nullable = true)

I want to select only the records with the "open" column that is "true". The following command I run in PySpark returns nothing:

yelp_df.filter(yelp_df["open"] == "true").collect()

Answers

You're comparing data types incorrectly. open is listed as a Boolean value, not a string, so doing yelp_df["open"] == "true" is incorrect - "true" is a string.

Instead you want to do

yelp_df.filter(yelp_df["open"] == True).collect()

This correctly compares the values of open against the Boolean primitive True, rather than the non-Boolean string "true".

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐