- November 25, 2020
- saptrxuy_learnit
- 0 Comments
- 2006 Views
- 0 Likes
- Python
Machine Learning with Diamond Data – Part 1
1. Introduction
2. Load data
3. Print list of members and methods of loaded data object
4. Print details of loaded dataset
5. Find unique values of some columns
6. Change the nominal values of cut into ordinal (numeric values)
7. Change nominal values of color and clarity as well into ordinal values
8. set price as target and create a new dataset X by dropping this column from original dataset
9. Use RobustScaler to transform X
10. Create a dataset Y with only target column
11. Create train and test data
12. Create a dataframe to store result of different models
13. Create KNN model
14. Fit the data into model
15. Update result of model into model matrix
16. Implement Bagging Model
17. RandomForest Model
18. Boosting Model
19. Save Models into a file
1.Introduction▲
Download data from https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv
1.1.Code▲
2.Load data▲
2.1.Code▲
10:# Download data from https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv 20:import pandas as pd 30: 40:data_path= 'F:/data/input/diamonds.csv' 50:diamonds = pd.read_csv(data_path) 9000:print("\n", 50 * "-", "\nProgram Over")
3.Print list of members and methods of loaded data object▲
3.1.Code▲
60: 70:print(dir(diamonds.__class__))
3.2.Output▲
['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '__abs__', '__add__', '__and__', '__array__', '__array_priority__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_add_numeric_operations', '_add_series_only_operations', '_add_series_or_dataframe_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_aggregate', '_aggregate_multiple_funcs', '_align_frame', '_align_series', '_box_col_values', '_box_item_values', '_builtin_table', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_percentile', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_combine_const', '_combine_frame', '_combine_match_columns', '_combine_match_index', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_dict_from', '_construct_axes_from_arguments', '_constructor', '_constructor_expanddim', '_constructor_sliced', '_convert', '_count_level', '_create_indexer', '_cython_table', '_data', '_deprecations', '_dir_additions', '_dir_deletions', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_from_arrays', '_from_axes', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cacher', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_space_character_free_column_resolvers', '_get_value', '_get_values', '_getitem_bool_array', '_getitem_frame', '_getitem_multilevel', '_gotitem', '_iget_item_cache', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_internal_get_values', '_internal_names', '_internal_names_set', '_is_builtin_func', '_is_cached', '_is_copy', '_is_cython_func', '_is_datelike_mixed_type', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_numeric_mixed_type', '_is_view', '_ix', '_ixs', '_join_compat', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_needs_reindex_multi', '_obj_with_exclusions', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_selected_obj', '_selection', '_selection_list', '_selection_name', '_series', '_set_as_cached', '_set_axis', '_set_axis_name', '_set_is_copy', '_set_item', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_setup_axes', '_shallow_copy', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_to_dict_of_blocks', '_try_aggregate_string_function', '_typ', '_unpickle_frame_compat', '_unpickle_matrix_compat', '_update_inplace', '_validate_dtype', '_values', '_where', '_xs', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'as_blocks', 'as_matrix', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'axes', 'between_time', 'bfill', 'blocks', 'bool', 'boxplot', 'clip', 'clip_lower', 'clip_upper', 'columns', 'combine', 'combine_first', 'compound', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'from_dict', 'from_items', 'from_records', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_value', 'get_values', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'is_copy', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'ix', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_index', 'set_value', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'sparse', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_msgpack', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sparse', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unstack', 'update', 'values', 'var', 'where', 'xs']
4.Print details of loaded dataset▲
4.1.Code▲
60: 70:#print(dir(diamonds.__class__)) 80: 90:print("\n", "diamonds.head(10)", "\n", diamonds.head(10)) 100:print("\n", "diamonds.columns", "\n", diamonds.columns) 110: 120:print("\n", "diamonds.shape", "\n", diamonds.shape)
4.2.Output▲
diamonds.head(10) carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48 6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47 7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53 8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49 9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39 diamonds.columns Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z'], dtype='object') diamonds.shape (53940, 10)
5.Find unique values of some columns▲
5.1.Code▲
130: 140:print("\n", "diamonds['cut'].unique()", "\n", diamonds['cut'].unique()) 150:print("\n", "diamonds['color'].unique()", "\n", diamonds['color'].unique()) 160:print("\n", "diamonds['clarity'].unique()", "\n", diamonds['clarity'].unique())
5.2.Output▲
diamonds['cut'].unique() ['Ideal' 'Premium' 'Good' 'Very Good' 'Fair'] diamonds['color'].unique() ['E' 'I' 'J' 'H' 'F' 'G' 'D'] diamonds['clarity'].unique() ['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']
6.Change the nominal values of cut into ordinal (numeric values)▲
6.1.Code▲
170: 180:dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True) 190:print("\n", "dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)\n", "dummysCut.head(10)", "\n", dummysCut.head(10)) 200: 210:diamonds = pd.concat([diamonds, dummysCut],axis=1) 220:print("\n", "After concatinating dummysCut with diamonds \ndiamonds.head(10)", "\n", diamonds.head(10)) 230:print("\n", "diamonds.columns", "\n", diamonds.columns) 240:print("\n", "diamonds.head(10)", "\n", diamonds.head(10)) 250:
6.2.Output▲
dummysCut = pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True) dummysCut.head(10) cut_Good cut_Ideal cut_Premium cut_Very Good 0 0 1 0 0 1 0 0 1 0 2 1 0 0 0 3 0 0 1 0 4 1 0 0 0 5 0 0 0 1 6 0 0 0 1 7 0 0 0 1 8 0 0 0 0 9 0 0 0 1 After concatinating dummysCut with diamonds diamonds.head(10) carat cut color ... cut_Ideal cut_Premium cut_Very Good 0 0.23 Ideal E ... 1 0 0 1 0.21 Premium E ... 0 1 0 2 0.23 Good E ... 0 0 0 3 0.29 Premium I ... 0 1 0 4 0.31 Good J ... 0 0 0 5 0.24 Very Good J ... 0 0 1 6 0.24 Very Good I ... 0 0 1 7 0.26 Very Good H ... 0 0 1 8 0.22 Fair E ... 0 0 0 9 0.23 Very Good H ... 0 0 1 [10 rows x 14 columns] diamonds.columns Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z', 'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good'], dtype='object') diamonds.head(10) carat cut color ... cut_Ideal cut_Premium cut_Very Good 0 0.23 Ideal E ... 1 0 0 1 0.21 Premium E ... 0 1 0 2 0.23 Good E ... 0 0 0 3 0.29 Premium I ... 0 1 0 4 0.31 Good J ... 0 0 0 5 0.24 Very Good J ... 0 0 1 6 0.24 Very Good I ... 0 0 1 7 0.26 Very Good H ... 0 0 1 8 0.22 Fair E ... 0 0 0 9 0.23 Very Good H ... 0 0 1 [10 rows x 14 columns]
7.Change nominal values of color and clarity as well into ordinal values▲
7.1.Code▲
260: 270:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1) 280:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)],axis=1) 290:diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)],axis=1) 300:diamonds.drop(['cut','color','clarity'], axis=1, inplace=True) 310: 320:print("\n", "diamonds.head(10)", "\n", diamonds.head(10)) 330:print("\n", "diamonds.columns", "\n", diamonds.columns) 340:
7.2.Output▲
diamonds.head(10) carat depth table ... clarity_VS2 clarity_VVS1 clarity_VVS2 0 0.23 61.5 55.0 ... 0 0 0 1 0.21 59.8 61.0 ... 0 0 0 2 0.23 56.9 65.0 ... 0 0 0 3 0.29 62.4 58.0 ... 1 0 0 4 0.31 63.3 58.0 ... 0 0 0 5 0.24 62.8 57.0 ... 0 0 1 6 0.24 62.3 57.0 ... 0 1 0 7 0.26 61.9 55.0 ... 0 0 0 8 0.22 65.1 61.0 ... 1 0 0 9 0.23 59.4 61.0 ... 0 0 0 [10 rows x 24 columns] diamonds.columns Index(['carat', 'depth', 'table', 'price', 'x', 'y', 'z', 'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good', 'color_E', 'color_F', 'color_G', 'color_H', 'color_I', 'color_J', 'clarity_IF', 'clarity_SI1', 'clarity_SI2', 'clarity_VS1', 'clarity_VS2', 'clarity_VVS1', 'clarity_VVS2'], dtype='object')
8.set price as target and create a new dataset X by dropping this column from original dataset▲
8.1.Code▲
370:target_name = 'price' 380:X = diamonds.drop('price', axis=1) 390: 400:print("\n", "X.columns", "\n", X.columns) 410:print("\n", "X.head(5)", "\n", X.head(5)) 420:print(X.describe()) 430:
8.2.Output▲
X.columns Index(['carat', 'depth', 'table', 'x', 'y', 'z', 'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good', 'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good', 'color_E', 'color_F', 'color_G', 'color_H', 'color_I', 'color_J', 'clarity_IF', 'clarity_SI1', 'clarity_SI2', 'clarity_VS1', 'clarity_VS2', 'clarity_VVS1', 'clarity_VVS2'], dtype='object') X.head(5) carat depth table ... clarity_VS2 clarity_VVS1 clarity_VVS2 0 0.23 61.5 55.0 ... 0 0 0 1 0.21 59.8 61.0 ... 0 0 0 2 0.23 56.9 65.0 ... 0 0 0 3 0.29 62.4 58.0 ... 1 0 0 4 0.31 63.3 58.0 ... 0 0 0 [5 rows x 27 columns] carat depth ... clarity_VVS1 clarity_VVS2 count 53940.000000 53940.000000 ... 53940.000000 53940.000000 mean 0.797940 61.749405 ... 0.067760 0.093919 std 0.474011 1.432621 ... 0.251337 0.291719 min 0.200000 43.000000 ... 0.000000 0.000000 25% 0.400000 61.000000 ... 0.000000 0.000000 50% 0.700000 61.800000 ... 0.000000 0.000000 75% 1.040000 62.500000 ... 0.000000 0.000000 max 5.010000 79.000000 ... 1.000000 1.000000 [8 rows x 27 columns]
9.Use RobustScaler to transform X▲
9.1.Code▲
21:from sklearn.preprocessing import RobustScaler 440:#robust_scaler = RobustScaler() 450:robust_scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) 460: 470:X = robust_scaler.fit_transform(X) 480:print("\n", "after robust_scaler.fit_transform(X), X is as follows:", "\n", X.shape)
9.2.Output▲
after robust_scaler.fit_transform(X), X is as follows: (53940, 27)
10.Create a dataset Y with only target column▲
10.1.Code▲
490: 500:y = diamonds[target_name] 510:print("\n", "y", "\n", y.head(10))
10.2.Output▲
y 0 326 1 326 2 327 3 334 4 335 5 336 6 336 7 337 8 337 9 338 Name: price, dtype: int64
11.Create train and test data▲
11.1.Code▲
26:from sklearn.model_selection import train_test_split 520: 530:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55) 540: 550:print("\n", "X_train.shape", "\n", X_train.shape) 560:print("\n", "X_test.shape", "\n", X_test.shape) 570: 580:print("\n", "y_train.shape", "\n", y_train.shape) 590:print("\n", "y_test.shape", "\n", y_test.shape) 600:
11.2.Output▲
X_train.shape (43152, 27) X_test.shape (10788, 27) y_train.shape (43152,) y_test.shape (10788,)
12.Create a dataframe to store result of different models▲
12.1.Code▲
600:models = pd.DataFrame(index=['train_mse', 'test_mse'], 610: columns=['KNN', 'Bagging', 'RandomForest', 'Boosting']) 620: 630:print("\n", "models", "\n", models) 640: 650:
12.2.Output▲
models KNN Bagging RandomForest Boosting train_mse NaN NaN NaN NaN test_mse NaN NaN NaN NaN
13.Create KNN model▲
13.1.Code▲
660:print("\n", "import KNeighborsRegressor") 670:from sklearn.neighbors import KNeighborsRegressor 680: 690:print("create instance of KNeighborsRegressor") 700:knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1) 710:print("knn", "\n", 50 * "-", "\n", knn)
13.2.Output▲
import KNeighborsRegressor create instance of KNeighborsRegressor knn -------------------------------------------------- KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean', metric_params=None, n_jobs=-1, n_neighbors=20, p=2, weights='distance')
14.Fit the data into model▲
14.1.Code▲
720: 730:print("\n", "Use the training data to train the estimator") 740:knn.fit(X_train, y_train) 750:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 760:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 770:
14.2.Output▲
Use the training data to train the estimator After training, X_train -------------------------------------------------- [[ 0.015625 -0.06666667 0. ... 0. 0. 0. ] [ 2.046875 0.33333333 1. ... 0. 0. 0. ] [ 0.625 0.46666667 0. ... 0. 0. 0. ] ... [ 0.984375 -0.2 0.33333333 ... 0. 0. 0. ] [-0.609375 0.33333333 1. ... 1. 0. 0. ] [ 0.3125 1.8 1.33333333 ... 0. 0. 0. ]] After training, y_train -------------------------------------------------- 51408 2370 25582 14426 8877 4484 17084 6811 35353 898 ... 10213 4742 16253 6501 17352 6963 28967 435 4762 3689 Name: price, Length: 43152, dtype: int64
15.Update result of model into model matrix▲
15.1.Code▲
24:from sklearn.metrics import mean_squared_error 780: 790:# 4. Update the model matrix 800:models.loc['train_mse','KNN'] = mean_squared_error(y_pred=knn.predict(X_train), 810: y_true=y_train) 820: 830:models.loc['test_mse','KNN'] = mean_squared_error(y_pred=knn.predict(X_test), 840: y_true=y_test) 850: 860:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models) 870: 880: 890:
15.2.Output▲
models after evaluating -------------------------------------------------- KNN Bagging RandomForest Boosting train_mse 78.503 NaN NaN NaN test_mse 774504 NaN NaN NaN
16.Implement Bagging Model▲
16.1.Code▲
890:#1600 Implement Bagging Model 900:print("\n", "import BaggingRegressor") 910:#from sklearn.neighbors import KNeighborsRegressor #already imported 920:from sklearn.ensemble import BaggingRegressor 930: 940:print("create instance of KNeighborsRegressor and BaggingRegressor") 950:#knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1) 960: 970:knn_for_bagging = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean') 980: 990:bagging = BaggingRegressor(base_estimator=knn_for_bagging, n_estimators=15, max_features=0.75, 1000: random_state=55, n_jobs=-1) 1010: 1020:print("knn_for_bagging", "\n", 50 * "-", "\n", knn_for_bagging) 1030:print("bagging", "\n", 50 * "-", "\n", bagging) 1040: 1050:print("\n", "Use the training data to train the estimator") 1060:bagging.fit(X_train, y_train) 1070:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 1080:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 1090: 1100:# 4. Evaluate the model 1110:models.loc['train_mse','Bagging'] = mean_squared_error(y_pred=bagging.predict(X_train), 1120: y_true=y_train) 1130: 1140:models.loc['test_mse','Bagging'] = mean_squared_error(y_pred=bagging.predict(X_test), 1150: y_true=y_test) 1160: 1170:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models) 1180: 1190: 1200:print("\n", 50 * "-", "\nProgram Over")
16.2.Output▲
import BaggingRegressor create instance of KNeighborsRegressor and BaggingRegressor knn_for_bagging -------------------------------------------------- KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean', metric_params=None, n_jobs=None, n_neighbors=20, p=2, weights='distance') bagging -------------------------------------------------- BaggingRegressor(base_estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean', metric_params=None, n_jobs=None, n_neighbors=20, p=2, weights='distance'), bootstrap=True, bootstrap_features=False, max_features=0.75, max_samples=1.0, n_estimators=15, n_jobs=-1, oob_score=False, random_state=55, verbose=0, warm_start=False) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[ 0.015625 -0.06666667 0. ... 0. 0. 0. ] [ 2.046875 0.33333333 1. ... 0. 0. 0. ] [ 0.625 0.46666667 0. ... 0. 0. 0. ] ... [ 0.984375 -0.2 0.33333333 ... 0. 0. 0. ] [-0.609375 0.33333333 1. ... 1. 0. 0. ] [ 0.3125 1.8 1.33333333 ... 0. 0. 0. ]] After training, y_train -------------------------------------------------- 51408 2370 25582 14426 8877 4484 17084 6811 35353 898 ... 10213 4742 16253 6501 17352 6963 28967 435 4762 3689 Name: price, Length: 43152, dtype: int64 models after evaluating -------------------------------------------------- KNN Bagging RandomForest Boosting train_mse 78.503 125735 NaN NaN test_mse 774504 752601 NaN NaN --------------------------------------------------
17.RandomForest Model▲
Warning!!!
It take 10-20 minutes
17.1.Code▲
1270: 1280: 1290:print("\n", "import RandomForestRegressor") 1300:from sklearn.ensemble import RandomForestRegressor 1390: 1400:RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=55, n_jobs=-1) 1410: 1420:print("RF", "\n", 50 * "-", "\n", RF) 1430: 1440: 1450:print("\n", "Use the training data to train the estimator") 1460:RF.fit(X_train, y_train) 1470:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 1480:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 1490: 1500: 1510:models.loc['train_mse','RandomForest'] = mean_squared_error(y_pred=RF.predict(X_train), 1520: y_true=y_train) 1530: 1540:models.loc['test_mse','RandomForest'] = mean_squared_error(y_pred=RF.predict(X_test), 1550: y_true=y_test) 1560: 1570: 1580:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models) 1590: 1600: 1620:
17.2.Output▲
import RandomForestRegressor RF -------------------------------------------------- RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=16, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1, oob_score=False, random_state=55, verbose=0, warm_start=False) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[ 0.015625 -0.06666667 0. ... 0. 0. 0. ] [ 2.046875 0.33333333 1. ... 0. 0. 0. ] [ 0.625 0.46666667 0. ... 0. 0. 0. ] ... [ 0.984375 -0.2 0.33333333 ... 0. 0. 0. ] [-0.609375 0.33333333 1. ... 1. 0. 0. ] [ 0.3125 1.8 1.33333333 ... 0. 0. 0. ]] After training, y_train -------------------------------------------------- 51408 2370 25582 14426 8877 4484 17084 6811 35353 898 ... 10213 4742 16253 6501 17352 6963 28967 435 4762 3689 Name: price, Length: 43152, dtype: int64 models after evaluating -------------------------------------------------- KNN Bagging RandomForest Boosting train_mse 78.503 125735 142396 NaN test_mse 774504 752601 374999 NaN --------------------------------------------------
18.Boosting Model▲
Warning!!!
It take 20-30 minutes
18.1.Code▲
1700:print("\n", "import AdaBoostRegressor") 1710:from sklearn.ensemble import AdaBoostRegressor 1720: 1730:print("create instance of RandomForestRegressor") 1740:#knn = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean', n_jobs=-1) 1750: 1760:#knn_for_bagging = KNeighborsRegressor(n_neighbors=20, weights='distance', metric='euclidean') 1770: 1780:#bagging = BaggingRegressor(base_estimator=knn_for_bagging, n_estimators=15, max_features=0.75, 1790:# random_state=55, n_jobs=-1) 1800: 1810:#RF = RandomForestRegressor(n_estimators=50, max_depth=16, random_state=55, n_jobs=-1) 1820:boosting = AdaBoostRegressor(n_estimators=50, learning_rate=0.05, random_state=55) 1830: 1840: 1850:print("boosting", "\n", 50 * "-", "\n", boosting) 1860: 1870: 1880:print("\n", "Use the training data to train the estimator") 1890:boosting.fit(X_train, y_train) 1900:print("\n", "After training, X_train", "\n", 50 * "-", "\n", X_train) 1910:print("\n", "After training, y_train", "\n", 50 * "-", "\n", y_train) 1920: 1930: 1940:models.loc['train_mse','Boosting'] = mean_squared_error(y_pred=boosting.predict(X_train), 1950: y_true=y_train) 1960: 1970:models.loc['test_mse','Boosting'] = mean_squared_error(y_pred=boosting.predict(X_test), 1980: y_true=y_test) 1990: 2000:print("\n", "models after evaluating", "\n", 50 * "-", "\n", models) 2010: 2020:
18.2.Output▲
import AdaBoostRegressor create instance of RandomForestRegressor boosting -------------------------------------------------- AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear', n_estimators=50, random_state=55) Use the training data to train the estimator After training, X_train -------------------------------------------------- [[ 0.015625 -0.06666667 0. ... 0. 0. 0. ] [ 2.046875 0.33333333 1. ... 0. 0. 0. ] [ 0.625 0.46666667 0. ... 0. 0. 0. ] ... [ 0.984375 -0.2 0.33333333 ... 0. 0. 0. ] [-0.609375 0.33333333 1. ... 1. 0. 0. ] [ 0.3125 1.8 1.33333333 ... 0. 0. 0. ]] After training, y_train -------------------------------------------------- 51408 2370 25582 14426 8877 4484 17084 6811 35353 898 ... 10213 4742 16253 6501 17352 6963 28967 435 4762 3689 Name: price, Length: 43152, dtype: int64 models after evaluating -------------------------------------------------- KNN Bagging RandomForest Boosting train_mse 78.503 125735 142396 1.82036e+06 test_mse 774504 752601 374999 1.81305e+06 --------------------------------------------------
19.Save Models into a file▲
Warning!!!
It take 20-40 minutes
19.1.Code▲
2030:import pickle 2040: 2050:filename = 'output/all_models.sav' 2060: 2070:print("\n", 50 * "-", "\nDumping models to", filename) 2080: 2090:pickle.dump(models, open(filename, 'wb')) 2100: 2110:
19.2.Output▲
models after evaluating -------------------------------------------------- KNN Bagging RandomForest Boosting train_mse 78.503 125735 142396 1.82036e+06 test_mse 774504 752601 374999 1.81305e+06 -------------------------------------------------- Dumping models to output/all_models.sav -------------------------------------------------- Program Over
Leave a Comment