|
@hspter | |||||
|
The way we talk about data science and focus so much on methods, we actually incentivize working with *bad* data, rather than spending the time to collect good data and then use easy methods with it
|
||||||
|
||||||
|
Hilary Parker
@hspter
|
31. sij |
|
I had a bit of a breakthrough in terms of my thinking of data science thanks to all the interesting discussions at #rstudioconf2020 --
|
||
|
|
||
|
Hilary Parker
@hspter
|
31. sij |
|
I want us to have a conference solely focused on how people collect data, and all the politics / product negotiations / etc that go along with that
|
||
|
|
||
|
Hilary Parker
@hspter
|
31. sij |
|
One small thing I am doing at Stitch Fix -- for the datasets we use, I'm referencing them as e.g. "the data that Cindy, Ping and Francesca created" rather than "the stylecard data".
|
||
|
|
||
|
Patrick Blanchenay
@PBlanchenay
|
1. velj |
|
This is a false dichotomy. In an ideal world, data collection informs the method, and the choice of method might influence data collection. And in reality, one usually has more choice over method than over data collection.
|
||
|
|
||
|
Hilary Parker
@hspter
|
2. velj |
|
I don't disagree that it is a false dichotomy. I think data scientists in general have much more influence than they might think, but it just takes more time and politics than most will tolerate
|
||
|
|
||
|
Ben Greve
@benjamingreve
|
31. sij |
|
Agreed. Spending time on dealing with data quality problems and calculating better, more relevant features will almost always contribute more to the predictive power of a model than tuning the model or increasing its complexity.
|
||
|
|
||
|
Cece🌊🌊🌊
@grl_must_ride
|
31. sij |
|
Part of the problem is data cleaning is seen as “less than”.
But I know the scientific decisions that are forfeited if you just let some low-pay lackey do it. Oh wait that’s me.
|
||
|
|
||
|
Arman Oganisian
@StableMarkets
|
1. velj |
|
I agree there's over-hype around cool methods and not enough thought about the data generating process that our sampling scheme should be capturing. But fancy methods are necessary too because accurately capturing the process is often infeasible (cost, ethical constraints, etc)
|
||
|
|
||
|
Hilary Parker
@hspter
|
1. velj |
|
Oh for sure, no doubt about that. But for many tech applications specifically, the juice is not worth the squeeze
|
||
|
|
||
|
Guy Maskall 🇪🇺 🔶 #FBPE
@GuyMaskall
|
1. velj |
|
That's starting to sound like putting "science" into "data science".
|
||
|
|
||
|
Hilary Parker
@hspter
|
1. velj |
|
it's funny bc I talked at a plant pathology conf recently and was like "I guess I don't have to tell you all to care about the underlying question"
|
||
|
|
||