Can someone explain Data Mining?
I've heard about this field of study under computer science called Data mining. I wanted to know exactly what it was and how difficult it is. #computer-software #cyber-security
4 answers
Hagen’s Answer
Hello Ellijah,
"Data Mining" can mean a lot of different things but generally Data Mining refers to the process of surfacing insights latent in the data.Sometimes those insights lie pretty close to the surface and could even be exposed querying a database with SQL or filtering a spreadsheet.Tableau is a visualization tool in which you can load spreadsheets or DB tables and then use it to create graphs and or geoplots. Sometimes that's all you need (e.g. you just want to understand when and where sales are trending up or down and what products or sales teams are driving those trends).
More sophisticated data mining entails creating models. Models are cool because they became a baseline or norm by which you evaluation new data or identify exceptions. Zillow has created pricing models based on a home's square footage, number of rooms, neighborhood, etc.. That allows people to estimate the sale price of their own home or evaluate the price someone is selling. Zillow also predicts the price trends such that they can estimate what the price will be in a year or 5 years out. That's all based more or less on what they know today (their existing models) what's happening in the market today (which informs and potentially improves those models) and the impact of historical trends on future prices.
Finally, statistics provides a mathematical calculation which helps substantiate the level of confidence we can associate with a given model. The standard "bell curve" in which most of the data points cluster towards the middle and top can be divided into standard deviations that allow us to say "we can be very confident in the data in the middle" and "less and less confident in data as it moves down and out" (traversing standard deviation boundaries). Hence, as data deviates from the norm, it becomes harder to predict and there is more risk when we try and predict what might happen. With a good statistical model we can associate a percentage of risk with a prediction based on where it lies on the curve, So an airline can over sell the number of seats on a flight knowing there is a 3% chance they'll get it wrong and have to pay a customer to bounce them to later flight.That 3% risk factor makes it a lot easier to make a business decision (e.g. it costs $300 to bounce a customer but we typically have 4 no shows which costs us $1200 so overselling 3 or 4 seats is a rational bet to place).
Hope that answers your question,
Best,
Hagen
Rob’s Answer
Lisa’s Answer
Data mining has to do with using mathematical software to manipulate large amounts of data and to then analyze it. It's called Big Data. This information is then used to make business decisions. The people who do this work are called Data Analyst. Think about researching something on Google.com. Or, the fact that your online purchases are now tracked and then used to later display images and advertisements of similar products on your FaceBook page. This is the result of data mining and analysis. People who do this are good at programming and graduate with math majors.
Eric’s Answer
The other answers are good, and I can share my personal experience.
You may, in school, have looked over a table of data and tried to make a decision. With, say, 10 data points, it's pretty easy to get a feeling for them all and what they mean.
With more data, that can be harder. With a few hundred data points, you can make some graphs and try to get a sense of things, and look over a fair number of points by hand to see what you can learn.
When the data gets to be larger, like millions of data points, and if the data is more complicated, most of these methods break down, and you have to come up with new methods. I was trying to learn things about every building in every city in America. There is no simple way to do that, and even if you select a random sample of them that is small enough to look at, you can't be sure that you haven't just seen a weird group which doesn't represent reality. In this case I had to come up with ways to group some of the buildings together and then plot the groups on a map of the country.
"Data mining" first implies that you have access to enormous amounts of data, which is sometimes harder than you think (and, as a Software Engineer, you should always think about whether you really need that much data in the first place! Collecting it has moral, ethical, and legal issues). Then, it involves the methods the other answerers gave to learn about it and not just be swamped.
Hope this helps!