Big thanks to Probspace for holding this competition, and congrats to all the top teams. I joined this competition 5 days before the deadline and I soonly discovered that data quality is very good, and my CV always goes with public LB (also private), it was truly a great experience for me.
In this documentation, I'll briefly talk about my solution, and some thought about this competition.
Currently, I am a Master's student at Tokyo Tech. For further information, please refer to my Kaggle Profile.
My final score (Public: 0.25848 / Private: 0.25854 / CV:0.261616) is based on a single LGBM (5 fold bagging) by select top700 features from lgb feature importance. I only use Lightgbm here since it is good for dealing with tabular data and relatively fast compared to Xgboost and Catboost, which allow you to test more ideas in limited time. For the validation scheme, I simply use 5-Fold cross-validation and it works very well, CV score always aligns with the LB score.
The data is a little bit dirty, but compared to data from Signate student cup 2019, it was not a problem at all for me. I just spent some time on transforming them from 全角 to 半角, then separating them into the single feature so that we can do some feature engineering on it.
My whole FE is composed of 6 parts :
Group method (numeric2cate) : Apply statistics of numeric features in different categorical features group. For example, applying "mean" on "面積（㎡）" group by "市区町村コード".
The statistics functions I used :
Group method (cate2cate) : Apply statistics of categorical features in different categorical features group. For example, applying "entropy" on the frequency table of "最寄駅：名称" group by "市区町村コード".
The statistics functions I used :
Target encoding : Reference and code
Count encoding : This works very well on some categorical features like "取引の事情等"
Feature from land_price.csv : Making features by 2 different "Group method" that I have mentioned above. Applying the statistics on the features that is grouped by "所在地コード", then just merge it to our train+test data
Feature pseudo-labeling : Build a LGBM model to predict the important features (I used "sq_meter", "land__mean__ON__h31_price", "nobeyuka_m2", "Age of house","time_to_nearest_aki"), and then take the oof predictions.
Suprisingly that tuning "alpha" in huber loss give me really a big boost (~0.001). In huber loss, alpha=1 basically means absolute loss (same formula). So if we lower the alpha value, it will make your model less sensitive to those "outlier cases".
lgb_param <- list(boosting_type = 'gbdt',
objective = "huber",
boost_from_average = 'false',
metric = "none",
learning_rate = 0.008,
num_leaves = 128,
# min_gain_to_split = 0.01,
feature_fraction = 0.05,
# feature_fraction_seed = 666666,
bagging_freq = 1,
bagging_fraction = 1,
min_sum_hessian_in_leaf = 5,
# min_data_in_leaf = 100,
lambda_l1 = 0,
lambda_l2 = 0,
alpha = 0.3
If you have any questions, please feel free to ask me!日本語で質問しても大丈夫です :))
Congratulations! I learn a lot from your work. Thanks for your share.
A couple of questions from me if you don't mind:
1) How did you end up with using huber loss? Because of some outliers?
2) I'm astonished to find the diversity of your engineered features. Especially it was very wise of you to come up with such many aggregation features by "市区町村コード". Were there any particularly strong features?
3) Related to 2), how effective were features generated from 'Feature pseudo-labeling'? I've never tried that strategy and am very interested in its effectiveness.
4) What was your imputation strategy in this competition?
5) There was so-to-say a 'leak' in this competition (https://prob.space/competitions/re_real_estate_2020/discussions/masato8823-Post9982d5b9dcd6a33111e0). Did you take advantage of it anyhow?
Presumably some of my questions can be self-answered from your code, but my R proficiency is very limited so...I'm sorry but I would appreciate your kind reply. Thanks in advance!
Hi katsu-san, thank you for your kind word :))
About the questions :
I start off my modeling by using typical RMSE as my loss function, and then I tried Fair loss and Huber loss. The reason I used Huber loss in my final solution is that it gave me a better CV score. The default setting in lightgbm is alpha=0.9, so I was thinking that lower the alpha value probably can make my model more robust, and in hindsight, it really does.
I didn't test those features one by one and examine the CV score so probably I can't tell you which feature is effective (boost CV a lot). But some features like "x_zscore__ON__nobeyuka_m2__BY__trans_date_yr" or "bayes_mean__ON__sq_meter__BY__nearest_aki_name" have really high "GAIN" in lgbm importance.
Those Feature pseudo-labeling features gave me ~0.001 boost on CV
Nothing, lightgbm can handle missing value very well
No, I don't know about this, it looks very interesting
Feel free to ask me if you have any other questions :))
Congratulation!! and thank you for your sharing!!
I have two questions if you don't mind.
Am I correct in recognizing that you did not use feature selection? Instead of feature selection, did the feature fraction of lightgbm make lower? If so, is it better than feature selection?
Do you have some purpose that both of lambda L1 and L2 of LightGBM are zero?
Hi pao-san :)
I do use the feature selection in this competition, and I forgot to mention it in the post, really sorry for that. In my final submission, I first run a model with full dataset (~1600 features), then select the top700 then rerun it, I'm surprised that it only gave me ~0.0003 improvement. About the feature fraction, no matter I am using full dataset or top700 one, the value of 0.05 always gave me best CV score.
Nope, just because those value gave me better CV
Congratulations! I am so grateful for your detailed sharing. It is very educational for me. Just reading this topic and I am fully satisfied with this competition.
Thank you amedama-san !!