博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
《Pandas Cookbook》第08章 数据清理
阅读量:7040 次
发布时间:2019-06-28

本文共 25089 字,大约阅读时间需要 83 分钟。


第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


In[1]: import pandas as pd        import numpy as np

1. 用stack清理变量值作为列名

# 加载state_fruit数据集 In[2]: state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)        state_fruitout[2]:
img_2d3f19fa1fad844566737f0f0ca19065.png
# stack方法可以将所有列名,转变为垂直的一级行索引 In[3]: state_fruit.stack()out[3]: Texas    Apple      12                 Orange     10                 Banana     40        Arizona  Apple       9                 Orange      7                 Banana     12        Florida  Apple       0                 Orange     14                 Banana    190        dtype: int64
# 使用reset_index(),将结果变为DataFrame In[4]: state_fruit_tidy = state_fruit.stack().reset_index()        state_fruit_tidyout[4]:
img_7ba529ab303b17338991cb3b67aae75b.png
# 重命名列名 In[5]: state_fruit_tidy.columns = ['state', 'fruit', 'weight']        state_fruit_tidyout[5]:
img_7cb59d2aff2e074673dbdfea37bf33c9.png
# 也可以使用rename_axis给不同的行索引层级命名 In[6]: state_fruit.stack()\                   .rename_axis(['state', 'fruit'])\out[6]: state    fruit         Texas    Apple      12                 Orange     10                 Banana     40        Arizona  Apple       9                 Orange      7                 Banana     12        Florida  Apple       0                 Orange     14                 Banana    190        dtype: int64
# 再次使用reset_index方法 In[7]: state_fruit.stack()\                   .rename_axis(['state', 'fruit'])\                   .reset_index(name='weight')out[7]:
img_8651151c83ecff8a6e673e2125d0a8a6.png

更多

# 读取state_fruit2数据集 In[8]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')        state_fruit2out[8]:
img_0e40fff35ec99bdb90c32c8f194a2590.png
# 州名不在行索引的位置上,使用stack将所有列名变为一个长Series In[9]: state_fruit2.stack()out[9]: 0  State       Texas           Apple          12           Orange         10           Banana         40        1  State     Arizona           Apple           9           Orange          7           Banana         12        2  State     Florida           Apple           0           Orange         14           Banana        190        dtype: object
# 先设定state作为行索引名,再stack,可以得到和前面相似的结果 In[10]: state_fruit2.set_index('State').stack()out[10]: 0  State       Texas            Apple          12            Orange         10            Banana         40         1  State     Arizona            Apple           9            Orange          7            Banana         12         2  State     Florida            Apple           0            Orange         14            Banana        190         dtype: object

2. 用melt清理变量值作为列名

# 读取state_fruit2数据集 In[11]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')         state_fruit2out[11]:
img_6db8941a099dd3091c10d39057d8134a.png
# 使用melt方法,将列传给id_vars和value_vars。melt可以将原先的列名作为变量,原先的值作为值。 In[12]: state_fruit2.melt(id_vars=['State'],                           value_vars=['Apple', 'Orange', 'Banana'])out[12]:
img_00d23ff2c6a8fef8dc6d1ddd4ab33638.png
# 随意设定一个行索引 In[13]: state_fruit2.index=list('abc')         state_fruit2.index.name = 'letter' In[14]: state_fruit2out[14]:
img_2ad059a62773ae9facf503cae78e9c4a.png
# var_name和value_name可以用来重命名新生成的变量列和值的列 In[15]: state_fruit2.melt(id_vars=['State'],                      value_vars=['Apple', 'Orange', 'Banana'],                      var_name='Fruit',                      value_name='Weight')out[15]:
img_922b3930de70c888c58f47d4288774ea.png
# 如果你想让所有值都位于一列,旧的列标签位于另一列,可以直接使用melt In[16]: state_fruit2.melt()out[16]:
img_2ce3adf33fc25f7e830ba8286338df28.png
# 要指明id变量,只需使用id_vars参数 In[17]: state_fruit2.melt(id_vars='State')out[17]:
img_8548df6c6fcb6bcf756dafde319ba57e.png

3. 同时stack多组变量

# 读取movie数据集,选取所有演员名和其Facebook likes In[18]: movie = pd.read_csv('data/movie.csv')         actor = movie[['movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name',                'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']]         actor.head()out[18]:
img_8bf454529a80707d1a5e7bbc069ea209.png
# 创建一个自定义函数,用来改变列名。wide_to_long要求分组的变量要有相同的数字结尾: In[19]: def change_col_name(col_name):             col_name = col_name.replace('_name', '')             if 'facebook' in col_name:                 fb_idx = col_name.find('facebook')                 col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]             return col_name In[20]: actor2 = actor.rename(columns=change_col_name)         actor2.head()out[20]:
img_064e8cde7c7733a40ed99b9db861e3f6.png
# 使用wide_to_long函数,同时stack两列actor和Facebook In[21]: stubs = ['actor', 'actor_facebook_likes']         actor2_tidy = pd.wide_to_long(actor2,                                        stubnames=stubs,                                        i=['movie_title'],                                        j='actor_num',                                        sep='_').reset_index()         actor2_tidy.head()out[21]:
img_11b986e4116caf0331bb4247728129f1.png

更多

# 加载数据 In[22]: df = pd.read_csv('data/stackme.csv')         dfout[22]:
img_6729997e781e5d68da35fad7e578f624.png
# 对列重命名 In[23]: df2 = df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',                                    'd':'group2_a1', 'e':'group2_b2'})         df2out[23]:
img_9e814d3e3e885caec88a811eff3afa49.png
# 设定stubnames=['group1', 'group2'],对任何数字都起作用 In[24]: pd.wide_to_long(df2,                          stubnames=['group1', 'group2'],                          i=['State', 'Country', 'Test'],                          j='Label',                          suffix='.+',                          sep='_')out[24]:
img_08c542d6ecefb0a27b7d6332802b558e.png

4. 反转stacked数据

# 读取college数据集,学校名作为行索引,,只选取本科生的列 In[25]: usecol_func = lambda x: 'UGDS_' in x or x == 'INSTNM'         college = pd.read_csv('data/college.csv',                                    index_col='INSTNM',                                    usecols=usecol_func)         college.head()out[25]:
img_0b44794b8eb62b4f49c01a4ecc23772c.png
# 用stack方法,将所有水平列名,转化为垂直的行索引 In[26]: college_stacked = college.stack()         college_stacked.head(18)out[26]: INSTNM                                         Alabama A & M University             UGDS_WHITE    0.0333                                     UGDS_BLACK    0.9353                                     UGDS_HISP     0.0055                                     UGDS_ASIAN    0.0019                                     UGDS_AIAN     0.0024                                     UGDS_NHPI     0.0019                                     UGDS_2MOR     0.0000                                     UGDS_NRA      0.0059                                     UGDS_UNKN     0.0138University of Alabama at Birmingham  UGDS_WHITE    0.5922                                     UGDS_BLACK    0.2600                                     UGDS_HISP     0.0283                                     UGDS_ASIAN    0.0518                                     UGDS_AIAN     0.0022                                     UGDS_NHPI     0.0007                                     UGDS_2MOR     0.0368                                     UGDS_NRA      0.0179                                     UGDS_UNKN     0.0100dtype: float64
# unstack方法可以将其还原 In[27]: college_stacked.unstack().head()out[27]:
img_9ca60dd71b3f8a2a14ef24bd408705bb.png
# 另一种方式是先用melt,再用pivot。先加载数据,不指定行索引名 In[28]: college2 = pd.read_csv('data/college.csv',                                usecols=usecol_func)         college2.head()out[28]:
img_43b2858d8b59fb834e50ade942e1333d.png
# 使用melt,将所有race列变为一列 In[29]: college_melted = college2.melt(id_vars='INSTNM',                                         var_name='Race',                                        value_name='Percentage')         college_melted.head()out[29]:
img_41131f74d48d9a3112ade43499814ef5.png
# 用pivot还原 In[30]: melted_inv = college_melted.pivot(index='INSTNM',                                           columns='Race',                                           values='Percentage')         melted_inv.head()out[30]:
img_22e9a3eba50b22cf8dd7ba03d702441a.png
# 用loc同时选取行和列,然后重置索引,可以获得和原先索引顺序一样的DataFrame In[31]: college2_replication = melted_inv.loc[college2['INSTNM'],                                                college2.columns[1:]]\                                                  .reset_index()         college2.equals(college2_replication)out[31]: True

更多

# 使用最外层的行索引做unstack In[32]: college.stack().unstack(0)out[32]:
img_4f5e372ffef93072496222e733221299.png
# 转置DataFrame更简单的方法是transpose()或T In[33]: college.Tout[33]:
img_1331447af978d00252372edbc4da20b7.png

5. 分组聚合后unstacking

# 读取employee数据集,求出每个种族的平均工资 In[34]: employee = pd.read_csv('data/employee.csv') In[35]: employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)out[35]: RACE         American Indian or Alaskan Native    60272         Asian/Pacific Islander               61660         Black or African American            50137         Hispanic/Latino                      52345         Others                               51278         White                                64419         Name: BASE_SALARY, dtype: int64
# 对种族和性别分组,求平均工资 In[36]: agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().astype(int)         aggout[36]: RACE                               GENDER         American Indian or Alaskan Native  Female    60238                                            Male      60305         Asian/Pacific Islander             Female    63226                                            Male      61033         Black or African American          Female    48915                                            Male      51082         Hispanic/Latino                    Female    46503                                            Male      54782         Others                             Female    63785                                            Male      38771         White                              Female    66793                                            Male      63940         Name: BASE_SALARY, dtype: int64
# 对索引层GENDER做unstack In[37]: agg.unstack('GENDER')out[37]:
img_18430579f6836593b336ebd425bbd1f2.png
# 对索引层RACE做unstack In[38]: agg.unstack('RACE')out[38]:
img_671934ec3c782798f4fda44a7d86b354.png

更多

# 按RACE和GENDER分组,求工资的平均值、最大值和最小值 In[39]: agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].agg(['mean', 'max', 'min']).astype(int)         agg2out[39]:
img_1646b3d3955b0427578ea983216de8a7.png
# 此时unstack('GENDER')会生成多级列索引,可以用stack和unstack调整结构agg2.unstack('GENDER')
img_a220ae236fd495a8b4a957f7c011ed60.png

6. 用分组聚合实现透视表

# 读取flights数据集 In[40]: flights = pd.read_csv('data/flights.csv')         flights.head()out[40]:
img_296493bccc4bafe56b5634f469a8bbdd.png
# 用pivot_table方法求出每条航线每个始发地的被取消的航班总数 In[41]: fp = flights.pivot_table(index='AIRLINE',                                   columns='ORG_AIR',                                   values='CANCELLED',                                   aggfunc='sum',                                  fill_value=0).round(2)         fp.head()out[41]:
img_1de5d72715b388f7cbf0da8520bfa454.png
# groupby聚合不能直接复现这张表。需要先按所有index和columns的列聚合 In[42]: fg = flights.groupby(['AIRLINE', 'ORG_AIR'])['CANCELLED'].sum()         fg.head()out[42]: AIRLINE  ORG_AIR         AA       ATL         3                  DEN         4                  DFW        86                  IAH         3                  LAS         3         Name: CANCELLED, dtype: int64
# 再使用unstack,将ORG_AIR这层索引作为列名 In[43]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)         fg_unstack.head()out[43]:
img_e5dcc001b5c7be1d5d98d77f19680171.png
# 判断两个方式是否等价 In[44]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)         fp.equals(fg_unstack)out[44]: True

更多

# 先实现一个稍微复杂的透视表 In[45]: fp2 = flights.pivot_table(index=['AIRLINE', 'MONTH'],                                   columns=['ORG_AIR', 'CANCELLED'],                                   values=['DEP_DELAY', 'DIST'],                                   aggfunc=[np.mean, np.sum],                                   fill_value=0)         fp2.head()out[45]:
img_24717b5a18b558c4466bb8c0de393704.png
# 用groupby和unstack复现上面的方法 In[46]: flights.groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED'])['DEP_DELAY', 'DIST'] \                .agg(['mean', 'sum']) \                .unstack(['ORG_AIR', 'CANCELLED'], fill_value=0) \                .swaplevel(0, 1, axis='columns') \                .head()out[46]:
img_e96b0f7c6020040db3a3338380d20106.png

7. 为了更容易reshaping,重新命名索引层

# 读取college数据集,分组后,统计本科生的SAT数学成绩信息 In[47]: college = pd.read_csv('data/college.csv') In[48]: cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'] \                     .agg(['count', 'min', 'max']).head(6) In[49]: cgout[49]:
img_7c16093b3337aba617fce9fc39f16e7a.png
# 行索引的两级都有名字,而列索引没有名字。用rename_axis给列索引的两级命名 In[50]:cg = cg.rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')        cgout[50]:
img_ef31aec4c4643b4d6b4056c82ea15f52.png
# 将AGG_FUNCS列移到行索引 In[51]:cg.stack('AGG_FUNCS').head()out[51]:
img_f21ec1ce9c3b51c54ab5b7e259b96bd6.png
# stack默认是将列放到行索引的最内层,可以使用swaplevel改变层级 In[52]:cg.stack('AGG_FUNCS').swaplevel('AGG_FUNCS', 'STABBR', axis='index').head()out[52]:
img_d40429b6e55a951a02e55969f11e2a4e.png
# 在此前的基础上再做sort_index In[53]:cg.stack('AGG_FUNCS') \          .swaplevel('AGG_FUNCS', 'STABBR', axis='index') \          .sort_index(level='RELAFFIL', axis='index') \          .sort_index(level='AGG_COLS', axis='columns').head(6)out[53]:
img_127bad7037d16fa21f14c61ff14bb9ce.png
# 对一些列做stack,对其它列做unstack In[54]:cg.stack('AGG_FUNCS').unstack(['RELAFFIL', 'STABBR'])out[54]:
img_6c107b3dc60a2a6895d03d132c399b18.png
# 对所有列做stack,会返回一个Series In[55]:cg.stack(['AGG_FUNCS', 'AGG_COLS']).head(12)out[55]:
img_bfbee4a739fcfead844df6e76795d99c.png

更多

# 删除行和列索引所有层级的名称 In[56]:cg.rename_axis([None, None], axis='index').rename_axis([None, None], axis='columns')out[56]:
img_a6136532a212cd33a15400508be7d815.png

8. 当多个变量被存储为列名时进行清理

# 读取weightlifting数据集 In[57]:weightlifting = pd.read_csv('data/weightlifting_men.csv')        weightliftingout[57]:
img_d993714941845282cd6809e898141ba7.png
# 用melt方法,将sex_age放入一个单独的列 In[58]:wl_melt = weightlifting.melt(id_vars='Weight Category',                                      var_name='sex_age',                                      value_name='Qual Total')        wl_melt.head()out[58]:
img_5187951e5c0812bcce79de78dd1dcc7e.png
# 用split方法将sex_age列分为两列 In[59]:sex_age = wl_melt['sex_age'].str.split(expand=True)        sex_age.head()out[59]:      0         1      0     M35     35-39      1     M35     35-39      2     M35     35-39      3     M35     35-39      4     M35     35-39
# 给列起名 In[60]:sex_age.columns = ['Sex', 'Age Group']        sex_age.head()out[60]:
img_656de127844aa1595c7af6ffc5e6e047.png
# 只取出字符串中的M In[61]:sex_age['Sex'] = sex_age['Sex'].str[0]        sex_age.head()out[61]:
img_e59686f91277a56c1aa806ccf06ab105.png
# 用concat方法,将sex_age,与wl_cat_total连接起来 In[62]:wl_cat_total = wl_melt[['Weight Category', 'Qual Total']]        wl_tidy = pd.concat([sex_age, wl_cat_total], axis='columns')        wl_tidy.head()out[62]:
img_cdf5def4fa80bb37cd93420032fd4cd9.png
# 上面的结果也可以如下实现 In[63]:cols = ['Weight Category', 'Qual Total']        sex_age[cols] = wl_melt[cols]

更多

# 也可以通过assign的方法,动态加载新的列 In[64]: age_group = wl_melt.sex_age.str.extract('(\d{2}[-+](?:\d{2})?)', expand=False)         sex = wl_melt.sex_age.str[0]         new_cols = {'Sex':sex,                      'Age Group': age_group} In[65]: wl_tidy2 = wl_melt.assign(**new_cols).drop('sex_age', axis='columns')         wl_tidy2.head()out[65]:
img_f649670bd79964295d6a62a9f38f41d4.png
# 判断两种方法是否等效 In[66]: wl_tidy2.sort_index(axis=1).equals(wl_tidy.sort_index(axis=1))out[66]: True

9. 当多个变量被存储为列的值时进行清理

# 读取restaurant_inspections数据集,将Date列的数据类型变为datetime64 In[67]: inspections = pd.read_csv('data/restaurant_inspections.csv', parse_dates=['Date'])         inspections.head(10)out[67]:
img_cf128dd4be6626799c3744ea24562fb3.png
# 用info列的所有值造一个新列。但是,Pandas不支持这种功能 In[68]: inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)    297             try:--> 298                 codes, categories = factorize(values, sort=True)    299             except TypeError:/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)    559     check_nulls = not is_integer_dtype(original)--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)    561 pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21922)()ValueError: Buffer has wrong number of dimensions (expected 1, got 2)During handling of the above exception, another exception occurred:NotImplementedError                       Traceback (most recent call last)
in
()----> 1 inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values) 3851 """ 3852 from pandas.core.reshape.reshape import pivot-> 3853 return pivot(self, index=index, columns=columns, values=values) 3854 3855 def stack(self, level=-1, dropna=True):/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values) 375 index = self[index] 376 indexed = Series(self[values].values,--> 377 index=MultiIndex.from_arrays([index, self[columns]])) 378 return indexed.unstack(columns) 379 /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in from_arrays(cls, arrays, sortorder, names) 1098 from pandas.core.categorical import _factorize_from_iterables 1099 -> 1100 labels, levels = _factorize_from_iterables(arrays) 1101 if names is None: 1102 names = [getattr(arr, "name", None) for arr in arrays]/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables) 2191 # For consistency, it should return a list of 2 lists. 2192 return [[], []]-> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in
(.0) 2191 # For consistency, it should return a list of 2 lists. 2192 return [[], []]-> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values) 2163 codes = values.codes 2164 else:-> 2165 cat = Categorical(values, ordered=True) 2166 categories = cat.categories 2167 codes = cat.codes/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath) 308 309 # FIXME--> 310 raise NotImplementedError("> 1 ndim Categorical are not " 311 "supported at this time") 312 NotImplementedError: > 1 ndim Categorical are not supported at this time
# 将'Name','Date', 'Info'作为航所索引 In[69]: inspections.set_index(['Name','Date', 'Info']).head(10)out[69]:
img_349eae336626fe62ef42f3efe514d933.png
# 用pivot,将info列中的值变为新的列 In[70]: inspections.set_index(['Name','Date', 'Info']).unstack('Info').head()out[70]:
img_3eeed924bfcbd5d77fdc48b605ff58e4.png
# 用reset_index方法,使行索引层级与列索引相同 In[71]: insp_tidy = inspections.set_index(['Name','Date', 'Info']) \                                        .unstack('Info') \                                        .reset_index(col_level=-1)         insp_tidy.head()out[71]:
img_563fe5bd25e95f35d66d0a76669c7cdc.png
# 除掉列索引的最外层,重命名行索引的层为None In[72]: insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)         insp_tidy.head()out[72]:
img_50d4d2cb0c3ea71d7135a88f28113aa3.png
# 使用squeeze方法,可以避免前面的多级索引 In[73]: inspections.set_index(['Name','Date', 'Info']) \                    .squeeze() \                    .unstack('Info') \                    .reset_index() \                    .rename_axis(None, axis='columns')out[73]:
img_f16f9edc08e2ff11a16d44c1f0c2c626.png

更多

# pivot_table需要传入聚合函数,才能产生一个单一值 In[74]: inspections.pivot_table(index=['Name', 'Date'],                                  columns='Info',                                  values='Value',                                  aggfunc='first') \                    .reset_index()\                    .rename_axis(None, axis='columns')out[74]:
img_b0adf6afb14d7983b6ec95bd9b55f7bf.png

10. 当两个或多个值存储于一个单元格时进行清理

# 读取texas_cities数据集 In[75]: cities = pd.read_csv('data/texas_cities.csv')         citiesout[75]:
img_fc3c5bc0232329c1202fcb48530e89d3.png
# 将Geolocation分解为四个单独的列 In[76]: geolocations = cities.Geolocation.str.split(pat='. ', expand=True)         geolocations.columns = ['latitude', 'latitude direction', 'longitude', 'longitude direction']         geolocationsout[76]:
img_cd57baf36c50b2e4c05e91ac0c8f99ab.png
# 转变数据类型 In[77]: geolocations = geolocations.astype({'latitude':'float', 'longitude':'float'})         geolocations.dtypesout[77]: latitude               float64         latitude direction      object         longitude              float64         longitude direction     object         dtype: object
# 将新列与原先的city列连起来 In[78]: cities_tidy = pd.concat([cities['City'], geolocations], axis='columns')         cities_tidyout[78]:
img_d2c04a464a195d5379fe9bbd6a2f1d82.png
# 忽略,作者这里是写重复了 In[79]: pd.concat([cities['City'], geolocations], axis='columns')out[79]:
img_d49a91df7ce4677988629d8472f56f18.png

原理

# 函数to_numeric可以将每列自动变为整数或浮点数 In[80]: temp = geolocations.apply(pd.to_numeric, errors='ignore')         tempout[80]:
img_400593fef91d2a7b5f232577ceb1286c.png
# 再查看数据类型 In[81]: temp.dtypesout[81]: latitude               float64         latitude direction      object         longitude              float64         longitude direction     object         dtype: object

更多

# |符,可以对多个标记进行分割 In[82]: cities.Geolocation.str.split(pat='° |, ', expand=True)out[82]:
img_459548a2e53417b5501fbfa93250785a.png
# 更复杂的提取方式 In[83]: cities.Geolocation.str.extract('([0-9.]+). (N|S), ([0-9.]+). (E|W)', expand=True)out[83]:
img_27114c0e735d1e8b12a3682e74031568.png

11. 当多个变量被存储为列名和列值时进行清理

# 读取sensors数据集 In[84]: sensors = pd.read_csv('data/sensors.csv')         sensorsout[84]:
img_dea4c547928b51aa150c2df33e8933d3.png
# 用melt清理数据 In[85]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year').head(6)out[85]:
img_7f7c1234d6aff0d89362a3124753363e.png
# 用pivot_table,将Property列转化为新的列名 In[86]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year') \                .pivot_table(index=['Group', 'Year'], columns='Property', values='value') \                .reset_index() \                .rename_axis(None, axis='columns')out[86]:
img_99fe0a44c1459bec4faca935b8f10dfd.png

更多

# 用stack和unstack实现上述方法 In[87]: sensors.set_index(['Group', 'Property']) \                .stack() \                .unstack('Property') \                .rename_axis(['Group', 'Year'], axis='index') \                .rename_axis(None, axis='columns') \                .reset_index()out[87]:
img_8693d0e1d0cbdf0a25650324c683139b.png

12. 当多个观察单位被存储于同一张表时进行清理

# 读取movie_altered数据集 In[88]: movie = pd.read_csv('data/movie_altered.csv')         movie.head()out[88]:
img_89a13de31564c2e520cbc6693cdcb33c.png
# 插入新的列,用来标识每一部电影 In[89]: movie.insert(0, 'id', np.arange(len(movie)))         movie.head()out[89]:
img_a6ac3f5422fd31f75cfc9e71b293e063.png
# 用wide_to_long,将所有演员放到一列,将所有Facebook likes放到一列 In[90]: stubnames = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes']         movie_long = pd.wide_to_long(movie,                                       stubnames=stubnames,                                       i='id',                                       j='num',                                       sep='_').reset_index()         movie_long['num'] = movie_long['num'].astype(int)         movie_long.head(9)out[90]:
img_067ca53e39c994225801a9bd840e55d7.png
# 将这个数据分解成多个小表 In[91]: movie_table = movie_long[['id','title', 'year', 'duration', 'rating']]         director_table = movie_long[['id', 'director', 'num', 'director_fb_likes']]         actor_table = movie_long[['id', 'actor', 'num', 'actor_fb_likes']] In[92]: movie_table.head(9)out[90]:
img_3c308a87d20cd193715510e34f09fd8e.png
In[93]: director_table.head(9)out[93]:
img_4765b51d51efa8e1541d249d5a874cd2.png
In[94]: actor_table.head(9)out[94]:
img_cb6c1449635a5edf6d9bd951a85fea06.png
# 做一些去重和去除缺失值的工作 In[95]: movie_table = movie_table.drop_duplicates().reset_index(drop=True)         director_table = director_table.dropna().reset_index(drop=True)         actor_table = actor_table.dropna().reset_index(drop=True) In[96]: movie_table.head()out[96]:
img_226df587a1d92ac669a04f2eb2f7e053.png
In[97]: director_table.head()out[97]:
img_4518638ab5ea4bfe5c351c49bc7a811d.png
# 比较内存的使用量 In[98]: movie.memory_usage(deep=True).sum()out[98]: 2318234 In[99]: movie_table.memory_usage(deep=True).sum() + \         director_table.memory_usage(deep=True).sum() + \         actor_table.memory_usage(deep=True).sum()out[99]: 2624898
# 创建演员和导演的id列 In[100]: director_cat = pd.Categorical(director_table['director'])          director_table.insert(1, 'director_id', director_cat.codes)          actor_cat = pd.Categorical(actor_table['actor'])          actor_table.insert(1, 'actor_id', actor_cat.codes)          director_table.head()out[100]:
img_82ba1521420e1bb830277dc59dcf943c.png
In[101]: actor_table.head()out[101]:
img_33fcfcb9abb3b1489a1c6a22be3a5fde.png
# 可以用这两张表生成要用的中间表。先来做director表 In[102]: director_associative = director_table[['id', 'director_id', 'num']]          dcols = ['director_id', 'director', 'director_fb_likes']          director_unique = director_table[dcols].drop_duplicates().reset_index(drop=True)          director_associative.head()         out[102]:
img_2e10466c0b4d84a1ce8a3fc111e25d71.png
In[103]: director_unique.head()out[103]:
img_d8bc7002b0537250a6c6ea3600f6ee1b.png
# 再来做actor表 In[104]: actor_associative = actor_table[['id', 'actor_id', 'num']]          acols = ['actor_id', 'actor', 'actor_fb_likes']          actor_unique = actor_table[acols].drop_duplicates().reset_index(drop=True)          actor_associative.head()out[104]:
img_a87ee10562958b84cb90ee8dc4dd15d2.png
In[105]: actor_unique.head()out[105]:
img_3f9b268c20c50d05f16d5d31e913ccd1.png
# 查看新的表所使用的内存量 In[106]: movie_table.memory_usage(deep=True).sum() + \          director_associative.memory_usage(deep=True).sum() + \          director_unique.memory_usage(deep=True).sum() + \          actor_associative.memory_usage(deep=True).sum() + \          actor_unique.memory_usage(deep=True).sum()out[106]: 1833402
In[107]: movie_table.head()out[107]:
img_022c5c526cc19e924e571fc9ebc3511c.png
# 可以通过将左右表组合起来形成movie表。首先将附表与actor/director表结合,然后将num列pivot,再加上列的前缀 In[108]: actors = actor_associative.merge(actor_unique, on='actor_id') \                                    .drop('actor_id', 1) \                                    .pivot_table(index='id', columns='num', aggfunc='first')          actors.columns = actors.columns.get_level_values(0) + '_' + \                           actors.columns.get_level_values(1).astype(str)          directors = director_associative.merge(director_unique, on='director_id') \                                          .drop('director_id', 1) \                                          .pivot_table(index='id', columns='num', aggfunc='first')          directors.columns = directors.columns.get_level_values(0) + '_' + \                              directors.columns.get_level_values(1).astype(str) In[109]: actors.head()out[109]:
img_35e9403bd0497badeb227daa98b00bcf.png
In[110]: directors.head()out[110]:
img_ec9b421908225c45edba443839e920c5.png
In[111]: movie2 = movie_table.merge(directors.reset_index(), on='id', how='left') \                              .merge(actors.reset_index(), on='id', how='left') In[112]: movie2.head()out[112]:
img_96b0724a5564d5f4977ab7c221448052.png
In[113]: movie.equals(movie2[movie.columns])out[113]: True

第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


转载地址:http://cwaal.baihongyu.com/

你可能感兴趣的文章
Android开发学习笔记-SharedPreferences的用法
查看>>
JAVA & JSON详解
查看>>
Mac显示隐藏文件的终端命令
查看>>
Spring MVC controller间跳转 重定向 传参 (转)
查看>>
再破博客园登录
查看>>
Entity Framework在WCF中序列化的问题
查看>>
OpenCL快速入门
查看>>
选择生成日报表,月报表,年报表
查看>>
使用位操作
查看>>
Babelfish(二分)
查看>>
JS 中如何判断 undefined 和 null
查看>>
ftk学习记录(一个进度条文章)
查看>>
log4j直接输出日志到flume
查看>>
非正确使用浮点数据由项目产生BUG讨论的问题
查看>>
PHP5中的stdClass
查看>>
IntelliJ IDEA Community Edition 14.1.4下使用 Apache-Subversion搭建代码管理环境
查看>>
四种可变交流swap方法
查看>>
Lucene中的 Query对象
查看>>
二分基础
查看>>
物流英语
查看>>