翻译:如果成为数据科学家并像他们一样思考

浏览: 1462


How to think like a data scientist to become one

如果成为数据科学家并像他们一样思考

原文链接:

http://www.kdnuggets.com/2017/03/think-like-data-scientist-become-one.html

译者:

云戒:《全栈数据之门》作者 
为了便于各位对照,我把原文也附上,初始翻译使用google translate,修正词不达意的细节。并增加部分注释。


原文:

The author went from securities analyst to Head of Data Science at Amazon. He describes what he learned in his journey and gives 4 useful rules based on his experience.

作者从证券分析师转到亚马逊数据科学主管。他描述了在他经历中学到的东西,并根据他的经验给出了4条有用的规则建议。(云戒译注:security表示安全,securities表示证券

原文:

We have all read the headlines – data scientist is the sexiest job, there’s not enough of them and the salaries are very high. The role has been sold so well that the number of data science courses and college programs is growing like crazy. After my previous blog post I have received messages from people asking how to become a data scientist – which courses are the best, what steps to take, what is the fastest way to land a data science job?

我们都读过一些新闻的标题--数据科学家是最性感的职业,市场需求非常大,而且薪水很高。因为市场需求大,各种数据科学课程正在疯狂地增加,大学中也争相开设相应课程。在我之前的博客发表后,我收到了大量询问如何成为数据科学家的问题--哪些课程最好,学习步骤是什么,如何最快的找到数据科学的工作?

原文:

I tried to really think it through and I reflected on my personal experience – how did I get here? How did I become a data scientist? Am I a data scientist? My experience has been very mixed – I have started out as a securities analyst in an investment house using mainly Excel. I then slowly shifted towards business intelligence in the banking industry and multiple consultancy projects, eventually doing the actual “data science” – building predictive models, working with Big Data, crunching tons of numbers and writing code to do data analysis and machine learning – what in the earlier days was called “data mining”.

我试着根据自己个人经历来重新思考这个问题。我是如何到现在的水平的?我是如何成为数据科学家的? 甚至,我是一个数据科学家吗? 我的经验其实比较杂 - 我最开始用Excel进行证券分析,然后,慢慢转向银行的商业智能和做过多个咨询项目。在不经意间,发现自己实际上从事的就是“数据科学”了--构建预测模型,并且进行大数据工作。搞定大量的数据,编写代码进行数据分析和机器学习--这些在以前被称为“数据挖掘”的技术。

原文:

When the data science hype has started I tried to understand how is it different from what I have been doing so far. Should I learn new skills and become the data scientist instead of someone working “in analytics”?

当大家炒作数据科学时,我也曾试图去了解,他们和我之前了解那些技术的有什么区别。我应该学习新技能,成为数据科学家吗?而不是现在进行的所谓『分析』工作吗?

原文:

Like everybody obsessed with it I have started taking multiple courses, reading data books, doing data science specializations (and not finishing them …), coded a lot – I wanted to become THE one in the middle cross-section of the (in)famous data science Venn diagram. The reality I learned is that these “Data Science” unicorns ( the legendary people in the center “Data Science” section) rarely exist and even if they do – they are typically generalists who have knowledge in all of these areas but are “master of none”.

像所有迷恋数据科学的人一样,我也开始学习多个课程,阅读数据相关书籍,学习数据科学专业(虽然并没有完成...),写大量代码,我想成为『题图』的韦恩图中间的数据科学中的一员。然而,现实是,“数据科学”领域的独角兽(“数据科学”中的传奇人物 )却很少,即使他们在做这样的工作--他们通常是在『题图』中这些领域的通才,却并没有精通其中一门。云戒译注:数据相关领域的通才,即全栈之才)。

原文:

Although I now consider myself a data scientist – I lead a fantastically talented data science team in Amazon, build machine learning models, work with “Big data” – I still think there’s too much chaos around the craft and much less clarity, especially for people thinking of switching careers. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me get into and are still helping me survive in the data science.

诚然,我现在也把自己当成一个数据科学家了。我在亚马逊领导了一个极具天赋的数据科学团队,构建机器学习模型,并且使用“大数据技术”。但是,我仍然认为表达这个问题不是那么容易,也会混淆。特别是对于想换行的人来说。不要误会我,有很多非常复杂的数据科学分支 - 如AI,机器人,计算机视觉,语音识别等 - 这需要非常高深的技术和数学能力,可能需要博士学位...,甚至是两个博士学位。但是,但是,如果您有兴趣于数据科学这个行业(在几年前被称为业务分析/数据分析),可以看看下面四项建议。他们曾经帮助我进入数据科学,并且仍然帮助我在数据科学中走得更远。


Rule 1 – Get your priorities and motivations straight.

规则1:弄清你的优势,并直指目标

原文:

Be very realistic about what skills you have right now and where you want to arrive – there are so many different types of roles in data science, it’s important to understand and assess you current knowledge base. Let’s say you’re working in HR and want to change careers – learn about HR analytics! If you’re a lawyer – understand the data applications in the legal industry. The fact is that the hunger for insights is so big that all industries and business functions have started using it. If you already have a job then try to understand what can be optimized or solved by using data and learn how to do it yourself. It’s going to be gradual and long shift but you will still have a job and learn by doing it in the real world. If you are a recent graduate or a student – you have a perfect chance to figure out what are you passionate about – maybe movies, maybe music, or maybe cars? You wouldn’t imagine the amount of data scientists these industries employ – and they are all crazy about the fields they’re working in.

一个非常现实的问题,是清楚您现在所拥有的技能以及您想要达到的目标。数据科学中有许多不同类型的角色,重要的是要了解和评估您当前的知识库。 假设您在人力资源部工作,想要改变职业--那么学习人力资源分析!如果您是律师 --那就了解法律行业的数据应用。 


事实是,这些前景都非常大,所有的行业和业务功能已经开始使用数据科学。如果您已经有了一份工作,那么尝试通过使用数据来优化或解决实际的问题,并学习如何自行完成。 长期积累的量变,才能达到最终的质变,因为你可以一边工作一边学习。如果你是刚毕业生的学生--你有一个好机会找到你热衷的行业--也许电影,也许音乐,也许是汽车?现今,这些行业雇用的数据科学家的数量,都是难以想象的多,而且这帮数据科学家对所从事的领域都太疯狂了。


Rule 2 – Learn the basics very well.

规则2: 把基础技能学好。

原文:

Although the specifics of the each data science field are very different, the basics are the same. There are three areas where you should develop a strong foundation – basic data analysis, introductory statistics and coding.


虽然每个数据科学领域的细节是非常不同的,但基础是一样的。 你应该掌握好这三个领域的基础:基础数据分析,统计基础和编码。


原文:

Data analysis. You should understand and practice (a lot!) the basic data analysis techniques – what is a table, how to join two tables, what are the main techniques to analyze data organized in such way, how to build summary views on your dataset and draw initial conclusions from it, what is the exploratory data analysis, which visualizations help you understand and learn from data. This is very basic but believe me – master this you’ll have the fundamental skill that is absolutely mandatory for the job.


数据分析 您应该了解和大量实践基本的数据分析技术。如什么是表格,如何join两个表格,分析数据的主要技巧是如何组织起来,如何在数据集上绘制摘要视图并绘制最初的结论,什么是探索性数据分析,哪些可视化帮助您了解和从数据中学习。 这是非常基本的,但相信我,掌握这些技术,对你工作将非常有用。

原文:

Statistics. Also, get a very good grasp of introductory statistics – what is mean, median, when to use one over the other, what is a standard deviation and when it doesn’t make any sense to use it, why averages “lie” but are still the most used aggregated value everywhere, etc… When I say “introductory” I really mean “introductory”. Unless you are a mathematician and plan to become an econometrician who applies advanced statistical and econometric models to explain complex phenomenons – then yes, learn the advanced statistics. If you don’t have PhD in mathematics, just take your time and be patient and get a really good grasp of the basic statistics and probability.

统计知识此外,需要对『介绍』统计方法熟悉。什么是平均值,中位数,什么时候使用其中一个。什么是标准差,何时使用它没有任何意义,为什么平均数经常“欺骗”我们,但是仍然是最常用的聚合方法。...当我说『介绍』统计时,真的是表达『介绍』统计。除非是一个数学家,并计划成为一个计量经济学家,他们应用先进的统计和计量经济学模型来解释复杂的现象--那么是的,他们需要学习高级统计数据。 
如果你没有数学博士学位,只要花点时间,耐心一点,掌握基本的统计方法和概率基础即可。
云戒译注:作者说的『介绍』统计,即常说的描述性统计

原文:

Coding. And off course – learn how to code. This is the most over-used cliché advice but it’s actually a sound one. You should start from learning how to query a database with SQL first – believe it or not, most of the time data science teams spend are on data pulling and preparation, and a lot of it is done with SQL. So get your basics in place– build your own small database, write some “select * from my_table” lines and get a good grasp of the SQL fundamentals. You should also learn one (start with just one) data analysis language – be it R or Python. Both are great and knowing them does make a difference since many (although not all) positions require them. First learn the basics of the language you chose (quick tip – start from learning dplyr with ggplot2 packages for R, or pandas with Seaborn libraries for Python) and learn how to do data analysis with it. You don’t have to become a programmer to succeed in the field, it’s all about knowing how to use the language to do data analysis – you won’t have to become a world-class hacker to land a data science job.


编码这是必须的,学习如何编写代码。虽然这是陈词滥调的建议,但实际上是一个很好的建议。 您应该首先学习如何使用SQL查询数据库--无论你相信与否,数据科学团队大部分时间都花在数据提取和准备上,其中很多都是由SQL完成的。 所以让你的基础知识就位 -- 建立你自己的小数据库,写一些『select * from my_table』语句,并且很好地掌握了SQL基础知识。 您还应该学习一个(从头开始)数据分析语言 -- 无论是R还是Python。 两个都是伟大的,掌握他们,能提升你自己的境界。因为许多(虽然不是全部)职位要求他们。 首先学习您选择的语言的基础知识(快速提示 -- 从R中的dplyr和ggplot2包开始,或者从Python的pandas、Seaborn库开始),并学习如何进行数据分析。 
不需要你成为一个编程高手,所有这一切都是关于如何使用编程语言进行数据分析 - 您不必成为世界级的黑客就能找到数据科学的工作。


Rule 3 – Data science is about solving problems –find and solve one.

规则3 - 数据科学是解决问题 - 找到并解决问题。

原文:

One thing I have learned over the years is that one of the fundamental requirements for a data scientist is to be always asking questions and looking for problems. Now I don’t advise to do it 24/7 as you will definitely go insane, but be prepared to be the problem solver and look for the problems non-stop. You will be amazed how much available data is out there – maybe you want to analyze your spending patterns, identify sentiment patterns of your emails, or just build nice charts to track your city’s finances. The data scientist is responsible for questioning everything – is this campaign effective, are there any concerning trends, maybe some products under-perform and should be taken off the market, does the discount make sense or is it too big – these questions become hypotheses that are then validated or rejected by the data scientist. They are the raw material and key to success in the job as the more of them you will solve – the better you’ll be in your job.

我多年来学到的一件事是,数据科学家的基本要求之一是提出问题并寻找答案。 现在我不建议全天候这样做,因为你一定会疯了,但要准备好成为问题解决者,不停地寻找问题。 
你会惊讶于有多少可用数据在那里 -- 也许你想分析你的消费模式,识别电子邮件的情绪,或只是建立漂亮的图表来跟踪你的城市的财务状况。 数据科学家负责质疑一切 -- 这个运动是否有效,是否有任何相关的趋势,也许一些产品不足,应该被下架,折扣是有意义的还是太大 - 这些问题成为假设然后被数据科学家验证或否定。 
这些是工作成功最原始的积累,你解决的问题越多,你将会越成功。


Rule 4 – Start doing instead of planning what you will do “when”.

规则4:开始动手做,而不是没完了地做不知什么时候完成的计划。

原文:

This is applicable to any learning behavior but it’s especially true in data science. Be sure you start “doing” from the very first day you start learning. It’s very easy to put off the actual learning by just reading “about” data science, how it “should” be done, copy-pasting data analysis code from the book and running it on very simple datasets which you will never ever get in the real world.

这适用于任何学习行为,但在数据科学中尤其如此。确保你从开始学习的第一天就开始“动手做”。 通过阅读“关于”数据科学,如何“应该”完成,从书中复制数据分析代码并运行在非常简单的数据集上,但你却什么也没有学到。

原文:

With everything you learn – be sure you start applying it to the field you’re passionate about. That’s where the magic happens – writing your first line of code and seeing it fail, being stuck and not knowing what to do next, looking for an answer, finding a lot of different solutions none of which work, struggling to build your own one and finally passing a milestone – the “aha!” moment. This is where the actual learning happens. Learning by doing is the only way to learn data science – you don’t learn how to ride bike by reading about it, right? Same thing applies here – whatever you learn, be sure you apply it immediately and solve actual problems with real data.

随着你学到的一切 - 确保你开始将它应用于你热爱的领域。这就是魔术发生的地方 - 编写你的第一行代码,看到它失败,被卡住,不知道下一步该做什么,寻找一个答案,找到很多不同的解决方案,却没有一个能用。费尽千辛万苦,终于有一个能用的解决方案,内心才终于可以“哈哈!”,因为你达到了自己的一个里程碑。 这是实际学习中常遇到的事情,通过动手实践来学习,是学习数据科学的唯一途径 - 你不会通过阅读来学习骑自行车吧? 同样的事情适用于这里 - 无论你学到什么,一定要立即应用它,并解决数据中遇到的实际问题。

原文:

“If you spend too much time thinking about a thing, you’ll never get it done.” – a quote from the one of the most famous martial artist Bruce Lee captures the essence of this post. You have to apply what you learn and make sure you make your own mistakes.

“如果你花太多时间思考一件事情,你永远不会完成这件事情。” -这是引用 自世界著名武术宗师李小龙的话。这个思想,也是掌握本文的精髓。 你必须应用你学到的东西,并尝试让自己犯错误。


云戒译注本文知识点总结:

  • 搞数据科学,并非需要科班出生,原作者也是从Excel开始,一步步转换过来的。

  • 不要迷恋新名字,数据科学,机器学习,深度学习,数据挖掘,大数据,其实基础都是从数据分析过来的。

  • 如原作者所言,数据科学需要全栈之才,请关注译者云戒新书《全栈数据之门》,京东,当当和天猫都有现货。

  • 编码是基础能力,尤其是学好R或者Python很重要。因此,从程序员转换过来的人,请你们一定要坚持,你们已经有好的基础了。

  • 入门之路,千千万。重点在于用数据的思路,去解决你遇到的问题,日积月累,终会有满天繁星。

  • 多动手去做,遇到问题,尝试解决。问题是你的对手,像李小龙一样,用截拳道打败它,你就成功了。


推荐 0
本文由 全栈数据 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

0 个评论

要回复文章请先登录注册